Closed milsanore closed 6 years ago
Hey @milsanore -
To answer your question, this module absolutely should work with any of the spaCy models (including the much-compressed en_core_web_sm
English model), though I haven't explicitly tested those (a PR that added more unit tests would be awesome!).
But it sounds more like you're trying to avoid parsing your documents more than once - I should explicitly say that you don't need to load the models every time you parse a document. For example, the following code is totally valid, and only loads the en
module into memory once:
import spacy
nlp = spacy.load('en')
doc1 = nlp('This is some text.')
doc2 = nlp('This is some more text.')
Unfortunately, it's not possible with this module to do the language detection outside of a specific model's pipeline. So you're stuck with either parsing every document twice (once to detect the language and another time with the correct model to do whatever else you need) or detecting the language from a sample of each doc. Getting fancier with language detection would require to spaCy core (which was actually talked about in the issue that inspired this component).
Hi, first of all thanks, this seems like a super-useful library. Do I have to load the EN model to make it work? Can I load something more lightweight?
The way I envisage using this library is something like the following pseudocode:
If I have to load EN to get language detection to work, then I will be running EN twice. Just trying to avoid that.
Thanks