propublica / Capitol-Words

Scraping, parsing and indexing the daily Congressional Record to support phrase search over time, and by legislator and date
BSD 3-Clause "New" or "Revised" License
121 stars 34 forks source link

Memory Error for CREC parser #112

Closed dwillis closed 6 years ago

dwillis commented 6 years ago

For some dates, a memory error occurs when parsing that day's Congressional Record files:

Traceback (most recent call last):
  File "/mnt/capitolwords/capitolweb/parser/management/commands/run_crec_parser.py", line 85, in handle
    es_doc = crec.to_es_doc()
  File "/mnt/capitolwords/capitolweb/parser/crec_parser.py", line 383, in to_es_doc
    segments=self.segments,
  File "/usr/local/lib/python3.5/dist-packages/django/utils/functional.py", line 35, in __get__
    res = instance.__dict__[self.name] = self.func(instance)
  File "/mnt/capitolwords/capitolweb/parser/crec_parser.py", line 324, in segments
    sents = (sent.string for sent in self.textacy_text.spacy_doc.sents)
  File "/usr/local/lib/python3.5/dist-packages/django/utils/functional.py", line 35, in __get__
    res = instance.__dict__[self.name] = self.func(instance)
  File "/mnt/capitolwords/capitolweb/parser/crec_parser.py", line 231, in textacy_text
    return textacy.Doc(SPACY_NLP(text))
  File "/usr/local/lib/python3.5/dist-packages/spacy/language.py", line 341, in __call__
    doc = proc(doc)
  File "nn_parser.pyx", line 337, in spacy.syntax.nn_parser.Parser.__call__
  File "nn_parser.pyx", line 400, in spacy.syntax.nn_parser.Parser.parse_batch
  File "nn_parser.pyx", line 725, in spacy.syntax.nn_parser.Parser.get_batch_model
  File "nn_parser.pyx", line 84, in spacy.syntax.nn_parser.precompute_hiddens.__init__
  File "/usr/local/lib/python3.5/dist-packages/spacy/_ml.py", line 148, in begin_update
    self.W.reshape((self.nF*self.nO*self.nP, self.nI)).T)
MemoryError
will-horning commented 6 years ago

Can you post the command line arguments this failed on? Or any dates that this bug occurs for.

dwillis commented 6 years ago

Sure thing. This fails for a handful of dates so far. Among them: 2016-09-13 and 2016-09-12. The command:

python3 manage.py run_crec_parser --start_date=2016-09-11 --end_date=2016-09-13
will-horning commented 6 years ago

@dwillis

I wasn't able to reproduce this on my laptop, but that has 16gb of memory so its possible that the days that trigger this error just have a larger than normal amount of text to process. So, I would first try running this on a machine with more ram if you haven't already done so. Alternatively, you can try running it with an older version of spacy ("pip install spacy<2.0") as this may be related to an issue in the newer version (nothing we're doing in the capitol words code requires any newer features).

dwillis commented 6 years ago

@will-horning Ok, thanks! I'll try both of those options.