Possible to duplicate/copy corpus

This is for the purpose of experimenting with taking a "turn" in a corpus curation: Is it possible to make a copy/duplication of an existing corpus and then work on from an existing corpus? This would allow researchers to follow tangents and adding new features in the curation - e.g. I can now trace how different actors link to each other, but I would ALSO like to incorporate the news articles that they link to. Making a copy would allow me to experiment without the fear for making my existing corpus too messy by including a lot of new entities (e.g. the news articles). I don't know if this is at all possible without exporting the csv and then re-crawling all the imported URL's?

Hope it makes some kind of sense?

Hi Adam,

It makes complete sense, and is something we really would want to be able to do... but...

Currently there is no easy way to do this and it can only be done manually.

If you control the server where Hyphe is running, the simpliest solution if probably to make an identical copy by messing a bit with the databases:

stop hyphe
duplicate the corpus entry within the corpus collection of MongoDB's hyphe database and just change the id in the copy from projectid to some cloneprojectid
duplicate the whole MongoDB of the corpus (named hyphe-projectid) into another one with a name with the other id such as hyphe-cloneprojectid
copy paste within the traph directory the projectid directory into another cloneprojectid
restart hyphe

This should do the trick. A script was written a long time ago to do this here but it hasn't been maintained or practiced in a while, so I'd recommand to run it manually step by step.

Another solution if this is not your own server would be programmatically, using the exports and the API : first collect exports of all webentities of the corpus as well as all crawls (EXPORT & CRAWL/All crawl jobs pages), create a new corpus with the same settings, then write a small programme that calls Hyphe's API to declare within the corpus the definition of all webentities from the original corpus and then run all the same crawls (a simple shell client to the API is available in hyphe_backend/test_client.py and the full documentation of the API is there). Of course this won't ensure perfect reproducibility since it will recrawl webpages at a different time.

This is a feature we want to add to the interface but it is quite complex and we never took the time yet.

medialab / hyphe

Possible to duplicate/copy corpus #427