salsadigitalauorg / merlin-framework

Merlin - migration framework
GNU General Public License v3.0
16 stars 3 forks source link

Query support in Alias type #47

Closed stooit closed 5 years ago

stooit commented 5 years ago

Describe the bug URLs that contain query params are sometimes valid as a unique path to content (e.g /path/to/news?ItemID=123)

These query params get stripped by the Alias type, so the resulting dataset may result in lost data

Sample configuration

---
domain: http://cleanenergyregulator.gov.au

urls:
  - '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=659'
  - '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=667'
  - '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=669'
  - '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=666'
  - '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=668'
  - '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=665'

entity_type: news
mappings:
  -
    field: alias
    type: alias
  -
    field: title
    selector: '.contentPage-content-area h1'
    type: text
    processors:
      whitespace: { }

Expected behavior 6 unique items in the resulting json file.

stooit commented 5 years ago

Further to this the inverse is sometimes true: Sometimes query strings are not desirable or a relevant indicator of content uniqueness.

The solution should be configurable (e.g include_query: true)

derklempner commented 5 years ago

One can now exclude query and fragment from all urls, or specify settings for a single url:

---
domain: http://cleanenergyregulator.gov.au

urls:
  - '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=659'
  - '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=667'
  - '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=669'
  - '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=666'
  - '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=668#anchor-me'
  - '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=665#anchor-me-too'

url_options:
  include_query: false
  include_fragment: false
  urls:
    -
      url: '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=668#anchor-me'
      include_query: true
      include_fragment: true