stephbuon / digital-history

Instructional repository for "Text Mining as Historical Method"
GNU General Public License v3.0
7 stars 3 forks source link

Save a spaCy doc object of the Hansard corpus to disk #43

Closed stephbuon closed 3 years ago

stephbuon commented 3 years ago

Instead of students waiting 10 years for spaCy to parse the Hansard corpus, it would be great if they could just load an already parsed spaCy doc object.

Here is are some instructions for saving and loading spaCy doc objects.

@alexanderr - can you please save a parsed spaCy doc object of /scratch/group/history/hist_3368-jguldi/hansard_1970_79.csv as /scratch/group/history/hist_3368-jguldi/hansard_1870_9_doc_object?

You should be able to parse the Hansard corpus in parallel like this:

import multiprocessing as mp
import spacy
import pandas as pd

nlp = spacy.load("en_core_web_sm")

hansard = pd.read_csv('/scratch/group/history/hist_3368-jguldi/hansard_1970_79.csv')

def spacy_nlp_pipe(hansard):
tokens = []
lemma = []
pos = []

for doc in nlp.pipe(hansard, batch_size=1000):
if doc.is_parsed:
tokens.append([n.text for n in doc])
lemma.append([n.lemma_ for n in doc])
pos.append([n.pos_ for n in doc])
else:
tokens.append(None)
lemma.append(None)
pos.append(None)

return [tokens, lemma, pos]

pool = mp.Pool(processes = 36)
results = pool.map(spacy_nlp_pipe, hansard['text'].to_list())

Then save the parsed doc object using the instructions linked above.

Please add your code to digital-history/utilities

stephbuon commented 3 years ago

M2 has spaCy v2, not spaCy v3. As a result, the process of saving labels like "dep" to disk is proving to be difficult and a real time suck since we don't even know if it is possible.

To move forward during a time crunch, we have decided to change the data set so spaCy can just parse in real time.

When we update to spacy v3, the code will be waiting for us in digital-history/utilities.