Open step21 opened 7 years ago
I would like to take this up. But I might need some help breaking it down into some bite-sized tasks.
Can you update the issue with some pointers to the sites being used for data-scraping and the kind of cleaning that went into extracting information ?
Updated the issue. Hope this helps. Let me know if you need more info.
Intro: Originally, the harvesting was done manually, and then pruned/cleaned manually to remove text/translations with no or not sufficient content. Ideally, this would be rewritten so that it can run automatically and ingest new content on its own, and clean it automatically. This should probably be broken down into multiple parts.
Regarding the data: An explanation of the source data can be found here: http://www.steinheim-institut.de/cgi-bin/epidat?info=howtoharvest - the data endpoint is also in the gist above. This was originally parsed into one fixed list 'locs' for all locations, and this was used to generate all the urls and then each was fetched and parsed into a sqlite db. This code was only partially automated, and run somewhat manually if I remember correctly. The data is actually quite structured (as xml), so not much cleaning was necessary. The main cleaning that was necessary was done by hand, as some records do not contain tex (or very little) as the tombstone they are based on was for example too withered / broken down or only mentioned the names and dates. This was done as otherwise the display did not have all the necessary elements (see www.poeticrelief.org for the live site if you did not see it yet) I am not sure yet how to automate this, as often these text fields contained some data but only a few letters or similar so not 'enough', but where I am not sure what the 'cutoff' is. Anyway, initially all records should be imported, in the past it might also have been that I left out some fields or only parsed them incompletely such as graphics where I think there are sometimes multiple graphics but I might have only imported the first. You can find the current database here: https://www.dropbox.com/s/llq2cszh0kok0c7/teidb_dev.sqlite?dl=0
Where to start? You could maybe start with writing a 'harvest' class, to some degree I already started on this as I am currently writing support for mongodb, but that was just experimenting, trying to have a 'hav.py' (or another name) file next to run.py that would start collecting data etc but it is not that trivial as it should re-use code from flask as much as possible, without starting the webapp necessarily. (you do not have to focus on that, but maybe keep it in mind) Here https://gist.github.com/step21/cc7fe9829fa89c4077dd1075d430945c I posted a bastard of a 'harvest' class that is probably non-functional as it was WIP and includes some old, some new parts but just so you get an idea and it also includes the source url etc. You will also find some xml documents in the folder poerelief/data_harvesting/doc as an example or for testing. Furthermore, the endpoints produce json which you could also use to look at the data.
So the sub tasks could be:
If you save to sql, the db model should stay similar I think, though you can extend it if you like. Otherwise you can save to file(s) such as one json per document/locid or to mongodb and then I would also see one locid/document as one document in mongodb. (However, maybe this easier once I wrote mongodb support so maybe start with one of the other options? )
So I hope this makes it a bit clearer, let me know if you need any other infos.