medialab / hyphe

Websites crawler with built-in exploration and control web interface
http://hyphe.medialab.sciences-po.fr/demo/
GNU Affero General Public License v3.0
328 stars 59 forks source link

Forbid Hyphe from creating entities with only a TLD (and no hostname) when crawling the web #444

Closed Klocohdonou closed 2 years ago

Klocohdonou commented 2 years ago

Hello!

In the comments section of an online article I crawled, Hyphe found a link to http://WWW.org.
Here's the comment (the permalink doesn't seem to work; but if you scroll down, it's the comment by Crackly Philippe from January 12, 2021). The link is on the author name.

Since there is no hostname in this URL, Hyphe created an entity matching the entire .org TLD (LRU prefix s:http|h:org|), and many of the websites with a .org TLD that Hyphe found while crawling were gathered into this entity.

Would it be possible to prevent this behaviour from happening? Feel free to ask if you need additional information!

Thanks in advance, Kevin

boogheta commented 2 years ago

Yes that is clearly a problem which should be addressed and thanks for reporting its source (www.org) this explains this weird behaviour best, I've reported it in the past but couldn't find why nor reproduce it (cf #321)

boogheta commented 2 years ago

should take care of #341 together also

Klocohdonou commented 2 years ago

Thanks for the fix! Glad this report helped.