medialab / hyphe

Websites crawler with built-in exploration and control web interface
http://hyphe.medialab.sciences-po.fr/demo/
GNU Affero General Public License v3.0
329 stars 59 forks source link

Adding www.google.com/url to redirection list not working? #388

Closed ladelentes closed 2 years ago

ladelentes commented 4 years ago

When I crawl a website built with Google Sites, the only “discovered” entity is [google.com], despite several links to external websites actually present.

Google Sites changes all hyperlinks from [href=“http://somesite”] to [href=“http://www.google.com/url?q=http://somesite”].

The Redirection Domain ([www.google.com/url]) is not included in the current Configuration file.

I added both [www.google.com/url] and [google.com/url] to the list of Redirection Domains (I tried manually in the config file and through the web interface on localhost), but it hasn’t made a difference.

Even after adding this new redirection domain and restarting Hyphe, the real targets of the masked hyperlinks are not added as Discovered Web Entities.

Is this a bug? (please fix if possible) Or does something else need to happen for a new Redirection Domain to be recognised? (please add to documentation if possible)

Here's the example website http://feminicidiouruguay.net

Thank you for your help!

boogheta commented 4 years ago

Hi and thanks for the report, It seems like Google Sites redirections are javascript redirections and not server ones which makes it a lot more complex to handle. Fixing it would require to parse the html content in the response to identify such javascript redirection and handle them. Thuis should be done somewhere in here: https://github.com/medialab/hyphe/blob/master/hyphe_backend/crawler/hcicrawler/resolver.py#L30 The team is unfortunately quite busy with other tasks currently, but this is an issue which should indeed be adressed!

edit: it seems like most of the links could be handled using the META refresh field inside the response content (cf https://github.com/medialab/minet/blob/71d771cfaad2d3eb9d9893a54b63380b13ef4c68/minet/utils.py#L142-L164 ). Others such as https://www.google.com/url?q=https://www.facebook.com/Contaniunamenos/&sa=D&ust=1603455678482000&usg=AFQjCNFSANkezX4k8Fk4sY6xg30u6CHO2Q seem way more complex to apprehend.

edit2: well actually, taking a closer look at the code, it performs HEAD queries, so as such there's no way to actually see the response content and it would require bigger changes

ladelentes commented 4 years ago

Thanks for the response! I'll see if I can find a workaround.