statsmaths / cleanNLP

R package providing annotators and a normalized data model for natural language processing
GNU Lesser General Public License v2.1
209 stars 36 forks source link

Memory limit error #71

Closed dhicks closed 2 years ago

dhicks commented 4 years ago

I'm using cleanNLP with the spaCy backend to process a set of about 13k documents. Most of the documents are short, but some are quite long. I received this error:

Error in py_call_impl(callable, dots$args, dots$keywords) : 
  ValueError: [E088] Text of length 1142787 exceeds maximum of 1000000. The v2.x parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the `nlp.max_length` limit. The limit is in number of characters, so you can check whether your inputs are too long by checking `len(text)`. 

It looks like I need to nudge the memory limits up a bit to accommodate the very long texts. But I don't see any way to do that — the cleanNLP and reticulate functions I'm using don't seem to take arguments that get passed on to the backend. Any suggestions would be appreciated.