Open cncoleman opened 2 years ago
note from meeting discussions: since this will likely require a fair bit of tedious manual work, we should split this up systematically into manageable chunks. a google spreadsheet strikes me as the easiest for shared editing? checkboxes in github comments don't handle concurrent editing by multiple people well.
Do we want to wait on this until we have more detail from Dave Maas?
In order to keep track of the versions of the documents, we should first accession them with the relevant metadata. Since we will want to point to give search results and access at the level of the individual file, I think each document needs to have it's own druid.
It looks like Datashare will help us tremendously in evaluating the scraped documents. @jmartin-sul @gbasel @cncoleman will meet to review via Datashare and come up with a strategy.
Where did we over-collect and under-collect with the crawl?