mediacloud / metadata-lib

How Media Cloud approaches extracting metadata from online news stories
Apache License 2.0
12 stars 5 forks source link

May want to remove story source related query parameters! #67

Closed philbudne closed 1 year ago

philbudne commented 1 year ago

Seen in https://upstract.com/x/fdf95bf448e1f2a8?ref=rss URL works without it, and I've seen similar things in other URLs in the past.

If we ever get stories from a source via BOTH RSS (whether a direct feed, or via aggregators) and site maps, having URLs that include "tracking" information that shows how the URL was acquired would cause us to ingest the stories twice!

rahulbot commented 1 year ago

The ref param is already being removed. I see it in the list of parameters_to_remove and verified this with a manual test. As long as we are using normalized_url as the url we MD5 for uniqueness then this URL shouldn't present a problem,.