URL uniqueness in crawler

stooit commented 5 years ago

Description When crawling URLs there may be several variations of the same content added, which can lead to duplicated content in the resulting Merlin run.

This happens due to case sensitivity in URLs, query params, or the same content existing on multiple URLs on the same page. e.g:

/path/to/content
/Path/To/Content           # Case
/path/to/content?q=1       # Query
/path/to/content?q=1&a=2   # Alternate query
/unique/path               # Unique path

It is not safe to assume query params can be stripped, there are cases where these do represent unique URLs (e.g /news?ItemID=123.

Proposed solution The safest solution is to track URLs against a hash of the content of the page. Duplicates should be tracked in a separate file, as this data may be useful for redirects.

Additional context Related issue in Alias type: #47

derklempner commented 5 years ago

Query part should now be handled by #52.

derklempner commented 5 years ago

Semi-related #41

derklempner commented 5 years ago

Unfortunately the hash of the content might not be usable thanks to dynamic data generated server-side that may contain timestamps, cache-busters etc that bring doom. For instance, on the very first example of this I tried from http://cleanenergyregulator.gov.au, there is a hidden field which is itself a hash of the request.

Screen Shot 2019-08-09 at 23 36 14

stooit commented 5 years ago

Hmm, yeah.

We should add a ‘uniqueness selector’ so you can specify which part to hash.

Could be as simple as ‘body’, some sites may use canonical URL meta tags, etc

This may need to be a root level option with optional group overrides (eg perhaps a landing page has a different uniqueness selector to a basic page)

derklempner commented 5 years ago

Yeah, I thought about using body text or some stripped version of content, probably works for most cases. Another route is a similarity index of some kind, need to try a few things. I did quickly try stuff like similar_text, which works well but is crazy slow for this kind of large string.

derklempner commented 5 years ago

This has been implemented in feature/issue-48-url-uniqueness. Ended up going for a stripped out content hash, which can be tweaked via the config.

You can specify whether to find duplicates, the uniqueness selector and what node types to strip.

Currently, the class is instantiated in GenerateCommand::execute() and passed into runWeb and into the cURL requestCallback. It could potentially live inside the output class. Anyone have any strong where-it-should-live feelings?

It will build a url-content-duplicates.json that contains urls that are content duplicates.

Here is an example config:

---
domain: http://cleanenergyregulator.gov.au

urls:
  - '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=659'
  - '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=667'
  - '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=669'
  - '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=669#something-something'
  - '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=666'
  - '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=668#anchor-me'
  - '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=665#Anchor-Me'
  - '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=665#Anchor-Me-Duplicate'
  - '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=665#Anchor-Me-Duplicate-Another'

url_options:
  # Default false
  include_query: true       

  # Default false
  include_fragment: true

  # Default true
  find_content_duplicates: true

  # Default '//body'
  hash_selector: '//body' 

  # Default script, comment, style, input, head  
  hash_exclude_nodes:      
    - '//script'  
    - '//comment()'
    - '//style'
    - '//input'
    - '//head'

entity_type: news
mappings:
  -
    field: alias
    type: alias
  -
    field: title
    selector: '.contentPage-content-area h1'
    # selector: '.this-class-isnt-found-eh' 
    type: text
    processors:
      whitespace: { }

stooit commented 5 years ago

Done, noting that a follow-up exists to refactor certain bits in #68

salsadigitalauorg / merlin-framework

URL uniqueness in crawler #48