Closed stooit closed 5 years ago
Query part should now be handled by #52.
Semi-related #41
Unfortunately the hash of the content might not be usable thanks to dynamic data generated server-side that may contain timestamps, cache-busters etc that bring doom. For instance, on the very first example of this I tried from http://cleanenergyregulator.gov.au, there is a hidden field which is itself a hash of the request.
Hmm, yeah.
We should add a ‘uniqueness selector’ so you can specify which part to hash.
Could be as simple as ‘body’, some sites may use canonical URL meta tags, etc
This may need to be a root level option with optional group overrides (eg perhaps a landing page has a different uniqueness selector to a basic page)
Yeah, I thought about using body text or some stripped version of content, probably works for most cases. Another route is a similarity index of some kind, need to try a few things. I did quickly try stuff like similar_text
, which works well but is crazy slow for this kind of large string.
This has been implemented in feature/issue-48-url-uniqueness. Ended up going for a stripped out content hash, which can be tweaked via the config.
You can specify whether to find duplicates, the uniqueness selector and what node types to strip.
Currently, the class is instantiated in GenerateCommand::execute()
and passed into runWeb
and into the cURL requestCallback
. It could potentially live inside the output class. Anyone have any strong where-it-should-live feelings?
It will build a url-content-duplicates.json
that contains urls that are content duplicates.
Here is an example config:
---
domain: http://cleanenergyregulator.gov.au
urls:
- '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=659'
- '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=667'
- '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=669'
- '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=669#something-something'
- '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=666'
- '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=668#anchor-me'
- '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=665#Anchor-Me'
- '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=665#Anchor-Me-Duplicate'
- '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=665#Anchor-Me-Duplicate-Another'
url_options:
# Default false
include_query: true
# Default false
include_fragment: true
# Default true
find_content_duplicates: true
# Default '//body'
hash_selector: '//body'
# Default script, comment, style, input, head
hash_exclude_nodes:
- '//script'
- '//comment()'
- '//style'
- '//input'
- '//head'
entity_type: news
mappings:
-
field: alias
type: alias
-
field: title
selector: '.contentPage-content-area h1'
# selector: '.this-class-isnt-found-eh'
type: text
processors:
whitespace: { }
Done, noting that a follow-up exists to refactor certain bits in #68
Description When crawling URLs there may be several variations of the same content added, which can lead to duplicated content in the resulting Merlin run.
This happens due to case sensitivity in URLs, query params, or the same content existing on multiple URLs on the same page. e.g:
It is not safe to assume query params can be stripped, there are cases where these do represent unique URLs (e.g /news?ItemID=123.
Proposed solution The safest solution is to track URLs against a hash of the content of the page. Duplicates should be tracked in a separate file, as this data may be useful for redirects.
Additional context Related issue in
Alias
type: #47