medialab / hyphe

Websites crawler with built-in exploration and control web interface
http://hyphe.medialab.sciences-po.fr/demo/
GNU Affero General Public License v3.0
325 stars 59 forks source link

different results for same search, two years later #496

Open sofiatipa opened 6 months ago

sofiatipa commented 6 months ago

Hi,

I repeated a search I did nearly 2 years ago through Hyphe, I am trying to find the co-linkages between two webentities, but the results are quite different. The original search came up with 6 pages that were used by both sites, while the new search shows 3 different pages. Why is that happening? And, is there any way to retrieve the original search from your online version?

boogheta commented 6 months ago

Hello @sofiatipa, I can only guess, but over two years it would sound reasonable that the websites you crawled did change quite a bit since, hence returning logically different results as of today. You can try and use the webarchives to retrieve the same corpus as it was back then (activating it from an empty corpus in the Settings tab), but archives are not always complete so there's no warranty.

sofiatipa commented 6 months ago

Hi Benjamin,

thanks for your swift response. I did something similar, I used webarchives through a site called web.archive.orghttp://web.archive.org, because the site I want to crawl cannot be crawled from Hyphe anylonger (I don’t know why). My hint is that this is the problem, that something changed in the webpage I am trying to crawl that made it "un-crawlable” through hyphe.

The problematic page is https://www.geopolitika.ruhttp://Geopolitica.ru. I am trying to find the co-linkages with the page https://www.paulcraigroberts.org - it is a very simple search, even for me.

The search through web.archive.orghttp://web.archive.org does not give the original results, although it is from the same year and month as the original search (as you say, probably because archives are not always complete). This is turning a big problem for my team, as we are about to send our findings for publication, but now with this we are in trouble, please help!

[cid:11D081BD-27C2-4FBB-A58D-29FA69946B10]

I am a basic user of hyphe, so I don’t really know how to activate from an empty corpus (I can only see the pages I selected for the corpus in settings). Maybe I could pass you the project name & password to take a look?

All the best,

Sofia

Dr Sofia Tipaldou Assistant Professor Department of Political Science and History Panteion University of Social and Political Sciences

boogheta commented 6 months ago

Hello again,

It looks like the Geopolitika.ru website has quite an aggressive approach towards web crawler and it basically refuses most robots through some (quite smart) methods, which apparently also block Web.Archive.org from archiving it (see for instance here https://web.archive.org/web/20200417113623/https://www.geopolitika.ru/).

There is no way to make Hyphe work with this website as of today unfortunately.

You can although go back far enough in time before they put those measures in place: just explore the web archives until you find a functional version and ask Hyphe to crawl at that date. You can do so by inputting the url of the web archive directly into the IMPORT box of Hyphe.

For instance I got a crawl working with more than 70 pages visited in 2018 by using this url as startpoint: https://web.archive.org/web/20180212120000/https://www.geopolitika.ru