irff / ionscraper

Online Indonesian News Media Scraper
1 stars 0 forks source link

antaranews duplicate articles #72

Open irff opened 9 years ago

irff commented 9 years ago

There are many duplicate articles from antaranews

for example: http://128.199.81.117:5601/#/doc/langgar/langgar/news?id=d0ca76853cfdfe8ad36d6c96f039d2344745f0d0&_g=()

and

http://128.199.81.117:5601/#/doc/langgar/langgar/news?id=d0ca76853cfdfe8ad36d6c96f039d2344745f0d0&_g=()

the article source are the same, but it can't handle the difference in URL postfix:

http://www.antaranews.com/berita/495849/komisi-vi-dpr-akan-pertanyakan-peningkatan-listrik-kepada-angkasa-pura-ii?utm_campaign=news&utm_medium=populer&utm_source=populer_home

http://www.antaranews.com/berita/495849/komisi-vi-dpr-akan-pertanyakan-peningkatan-listrik-kepada-angkasa-pura-ii?utm_campaign=news&utm_medium=related&utm_source=fly

I think its best to ignore / trim the postfix string after the character '?' in the URL.

irff commented 9 years ago

There are often three occurences: http://128.199.81.117:5601/#/doc/langgar/langgar/news/?id=24776b50040aac9dfa11bc1f4559c8343b11070f

http://128.199.81.117:5601/#/doc/langgar/langgar/news/?id=2c0afa1806b738f35154024f207de91b32781df5

http://128.199.81.117:5601/#/doc/langgar/langgar/news/?id=ab36174c6b1a08b9bbda38db5d31788abd1d0b36

irff commented 9 years ago

And four:

http://128.199.81.117:5601/#/doc/langgar/langgar/news?id=84f6cd599d724f2176582054a8cb1c99920a3780&_g=()

http://128.199.81.117:5601/#/doc/langgar/langgar/news?id=0a84ea3d2a402105a24d4e2161e3af0f800b8550&_g=()

http://128.199.81.117:5601/#/doc/langgar/langgar/news?id=7ac241fc9d03d1f850a73cd9c8bc81ab0d4b0c89&_g=()

http://128.199.81.117:5601/#/doc/langgar/langgar/news?id=63080f3edd3a430987111919d7437fbb6e5a2102&_g=()

kandito commented 9 years ago

https://github.com/irff/ionscraper/issues/73