Closed stooit closed 5 years ago
Further to this the inverse is sometimes true: Sometimes query strings are not desirable or a relevant indicator of content uniqueness.
The solution should be configurable (e.g include_query: true
)
One can now exclude query and fragment from all urls, or specify settings for a single url:
---
domain: http://cleanenergyregulator.gov.au
urls:
- '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=659'
- '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=667'
- '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=669'
- '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=666'
- '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=668#anchor-me'
- '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=665#anchor-me-too'
url_options:
include_query: false
include_fragment: false
urls:
-
url: '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=668#anchor-me'
include_query: true
include_fragment: true
Describe the bug URLs that contain query params are sometimes valid as a unique path to content (e.g
/path/to/news?ItemID=123
)These query params get stripped by the
Alias
type, so the resulting dataset may result in lost dataSample configuration
Expected behavior 6 unique items in the resulting json file.