rdmpage / biorss

Harvest and repurpose RSS feeds
2 stars 0 forks source link

Merge duplicates #10

Open rdmpage opened 2 years ago

rdmpage commented 2 years ago

We can get duplicate references in several ways. For example, an aggregator such as Pubmed or Google Scholar may have the same reference as one we retrieve directly from the RSS feed for a journal. For example https://biorss.herokuapp.com/?feed=Y291bnRyeT1DTiZwYXRoPSU1QiUyMkJJT1RBJTIyJTJDJTIyQW5pbWFsaWElMjIlMkMlMjJBcnRocm9wb2RhJTIyJTJDJTIyQXJhY2huaWRhJTIyJTJDJTIyQXJhbmVhZSUyMiU1RA== retrieves https://pubmed.ncbi.nlm.nih.gov/34899007/?utm_source=Mobile%20Safari%20UI/WKWebView&utm_medium=rss&utm_campaign=pubmed-2&utm_content=1rE397IRBYU0-ogsyRnEw9o91K808u0evolcHK9IDZ0PVH5cqD&fc=20211108074834&ff=20211214094348&v=2.15.0 and https://zookeys.pensoft.net/article/73345/ both with the DOI 10.3897/zookeys.1072.73345.

rdmpage commented 2 years ago

Note also that some journal RSS feeds do not use stable URLs for feed items, instead these can change with each harvest. Hence https://biorss.herokuapp.com/?feed=Y291bnRyeT1DTiZwYXRoPSU1QiUyMkJJT1RBJTIyJTJDJTIyQW5pbWFsaWElMjIlMkMlMjJDaG9yZGF0YSUyMiUyQyUyMlJlcHRpbGlhJTIyJTVE has multiple records for the same reference: