Closed stephbuon closed 3 years ago
M2 has spaCy v2, not spaCy v3. As a result, the process of saving labels like "dep" to disk is proving to be difficult and a real time suck since we don't even know if it is possible.
To move forward during a time crunch, we have decided to change the data set so spaCy can just parse in real time.
When we update to spacy v3, the code will be waiting for us in digital-history/utilities.
Instead of students waiting 10 years for spaCy to parse the Hansard corpus, it would be great if they could just load an already parsed spaCy doc object.
Here is are some instructions for saving and loading spaCy doc objects.
@alexanderr - can you please save a parsed spaCy doc object of /scratch/group/history/hist_3368-jguldi/hansard_1970_79.csv as /scratch/group/history/hist_3368-jguldi/hansard_1870_9_doc_object?
You should be able to parse the Hansard corpus in parallel like this:
Then save the parsed doc object using the instructions linked above.
Please add your code to digital-history/utilities