medialab / hyphe

Websites crawler with built-in exploration and control web interface
http://hyphe.medialab.sciences-po.fr/demo/
GNU Affero General Public License v3.0
328 stars 59 forks source link

Possible to duplicate/copy corpus #427

Open adamveng opened 2 years ago

adamveng commented 2 years ago

This is for the purpose of experimenting with taking a "turn" in a corpus curation: Is it possible to make a copy/duplication of an existing corpus and then work on from an existing corpus? This would allow researchers to follow tangents and adding new features in the curation - e.g. I can now trace how different actors link to each other, but I would ALSO like to incorporate the news articles that they link to. Making a copy would allow me to experiment without the fear for making my existing corpus too messy by including a lot of new entities (e.g. the news articles). I don't know if this is at all possible without exporting the csv and then re-crawling all the imported URL's?

Hope it makes some kind of sense?

boogheta commented 2 years ago

Hi Adam,

It makes complete sense, and is something we really would want to be able to do... but...

Currently there is no easy way to do this and it can only be done manually.

If you control the server where Hyphe is running, the simpliest solution if probably to make an identical copy by messing a bit with the databases:

This should do the trick. A script was written a long time ago to do this here but it hasn't been maintained or practiced in a while, so I'd recommand to run it manually step by step.

Another solution if this is not your own server would be programmatically, using the exports and the API : first collect exports of all webentities of the corpus as well as all crawls (EXPORT & CRAWL/All crawl jobs pages), create a new corpus with the same settings, then write a small programme that calls Hyphe's API to declare within the corpus the definition of all webentities from the original corpus and then run all the same crawls (a simple shell client to the API is available in hyphe_backend/test_client.py and the full documentation of the API is there). Of course this won't ensure perfect reproducibility since it will recrawl webpages at a different time.

This is a feature we want to add to the interface but it is quite complex and we never took the time yet.