medialab / hyphe

Websites crawler with built-in exploration and control web interface
http://hyphe.medialab.sciences-po.fr/demo/
GNU Affero General Public License v3.0
328 stars 59 forks source link

different results for same search, two years later #496

Open sofiatipa opened 11 months ago

sofiatipa commented 11 months ago

Hi,

I repeated a search I did nearly 2 years ago through Hyphe, I am trying to find the co-linkages between two webentities, but the results are quite different. The original search came up with 6 pages that were used by both sites, while the new search shows 3 different pages. Why is that happening? And, is there any way to retrieve the original search from your online version?

boogheta commented 11 months ago

Hello @sofiatipa, I can only guess, but over two years it would sound reasonable that the websites you crawled did change quite a bit since, hence returning logically different results as of today. You can try and use the webarchives to retrieve the same corpus as it was back then (activating it from an empty corpus in the Settings tab), but archives are not always complete so there's no warranty.

sofiatipa commented 11 months ago

Hi Benjamin,

thanks for your swift response. I did something similar, I used webarchives through a site called web.archive.orghttp://web.archive.org, because the site I want to crawl cannot be crawled from Hyphe anylonger (I don’t know why). My hint is that this is the problem, that something changed in the webpage I am trying to crawl that made it "un-crawlable” through hyphe.

The problematic page is https://www.geopolitika.ruhttp://Geopolitica.ru. I am trying to find the co-linkages with the page https://www.paulcraigroberts.org - it is a very simple search, even for me.

The search through web.archive.orghttp://web.archive.org does not give the original results, although it is from the same year and month as the original search (as you say, probably because archives are not always complete). This is turning a big problem for my team, as we are about to send our findings for publication, but now with this we are in trouble, please help!

[cid:11D081BD-27C2-4FBB-A58D-29FA69946B10]

I am a basic user of hyphe, so I don’t really know how to activate from an empty corpus (I can only see the pages I selected for the corpus in settings). Maybe I could pass you the project name & password to take a look?

All the best,

Sofia

Dr Sofia Tipaldou Assistant Professor Department of Political Science and History Panteion University of Social and Political Sciences

boogheta commented 11 months ago

Hello again,

It looks like the Geopolitika.ru website has quite an aggressive approach towards web crawler and it basically refuses most robots through some (quite smart) methods, which apparently also block Web.Archive.org from archiving it (see for instance here https://web.archive.org/web/20200417113623/https://www.geopolitika.ru/).

There is no way to make Hyphe work with this website as of today unfortunately.

You can although go back far enough in time before they put those measures in place: just explore the web archives until you find a functional version and ask Hyphe to crawl at that date. You can do so by inputting the url of the web archive directly into the IMPORT box of Hyphe.

For instance I got a crawl working with more than 70 pages visited in 2018 by using this url as startpoint: https://web.archive.org/web/20180212120000/https://www.geopolitika.ru

sofiatipa commented 1 month ago

Hi Benjamin,

I have a new question to ask: the installed version of Hyphe stopped creating web entities out of some sites it previously crawled (in fact the last crawl was in August). I tried the crawl in the online demo version and it works perfectly. Any ideas why it might be happening with the desktop version?

Also, is it possible that the amount of pages hyphe crawls may vary from one day to another?

Many thanks!

Sofia

boogheta commented 1 month ago

Hello @sofiatipa, it's hard to tell without more information. But there's a priori no reason your desktop version of Hyphe would behave differently than the online demo. Did you try in a new corpus or in a preexisting one?

sofiatipa commented 1 month ago

Hi Benjamin,

I hope you are receiving my answer from my email. I tried it in a new corpus, twice. Today I am trying again to run the crawl to that new corpus, but it is still unable to define the web entities. In the meanwhile, the online version had already ran the crawl in a few minutes.

What more infos could I send you?

May I take the chance to ask you something else regarding the crawl depth in light of a publication me and my team are working on: How could I better explain the ‘crawl depth’ for non expert publics?

Many thanks!

Sofia

Dr Sofia Tipaldou Assistant Professor in International Relations Department of Political Science and History Panteion University of Social and Political Sciences

On 19 Sep 2024, at 17:17, Benjamin Ooghe-Tabanou @.**@.>> wrote:

Hello @sofiatipahttps://github.com/sofiatipa, it's hard to tell without more information. But there's a priori no reason your desktop version of Hyphe would behave differently than the online demo. Did you try in a new corpus or in a preexisting one?

— Reply to this email directly, view it on GitHubhttps://github.com/medialab/hyphe/issues/496#issuecomment-2361118290, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BEZKLH4IPECPWNGYRP7EJEDZXLMGNAVCNFSM6AAAAABOQAPSSSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNRRGEYTQMRZGA. You are receiving this because you were mentioned.Message ID: @.***>

boogheta commented 1 month ago

I apologize Sofia but I don't really understand what you mean by "unable to define the web entities". Could you precisely explain the steps you did and where you get stuck at? It might be that your whole local hyphe instance would require to be restarted, have you tried that?

Regarding the crawl depth, you can present it as the number of links a user would click from the starting page: For instance if you start from a specific startpage with a depth 2, the crawler will visit all pages (only those belonging to that specific website) that are linked from that page, and these are the pages of depth 1. Then it will similarly visit all pages of the website linked from those depth 1 pages, and these will be depth 2. Then it will stop.