nickdavidhaynes / spacy-cld

Language detection extension for spaCy 2.0+
MIT License
111 stars 9 forks source link

Do I have to load EN model? #3

Closed milsanore closed 6 years ago

milsanore commented 6 years ago

Hi, first of all thanks, this seems like a super-useful library. Do I have to load the EN model to make it work? Can I load something more lightweight?

The way I envisage using this library is something like the following pseudocode:

en_bios = []
de_bios = []

for item_doc in nlp_lang.pipe(arr_bio_tuples):
    switch(doc._.languages)
        case 'en':
            en_bios.append(item_doc[0].text)
        case 'de':
            de_bios.append(item_doc[0].text)

for item_doc in nlp_en.pipe(en_bios):
    # DO ENGLISH STUFF

for item_doc in nlp_de.pipe(de_bios):
    # DO GERMAN STUFF

If I have to load EN to get language detection to work, then I will be running EN twice. Just trying to avoid that.

Thanks

nickdavidhaynes commented 6 years ago

Hey @milsanore -

To answer your question, this module absolutely should work with any of the spaCy models (including the much-compressed en_core_web_sm English model), though I haven't explicitly tested those (a PR that added more unit tests would be awesome!).

But it sounds more like you're trying to avoid parsing your documents more than once - I should explicitly say that you don't need to load the models every time you parse a document. For example, the following code is totally valid, and only loads the en module into memory once:

import spacy
nlp = spacy.load('en')
doc1 = nlp('This is some text.')
doc2 = nlp('This is some more text.')

Unfortunately, it's not possible with this module to do the language detection outside of a specific model's pipeline. So you're stuck with either parsing every document twice (once to detect the language and another time with the correct model to do whatever else you need) or detecting the language from a sample of each doc. Getting fancier with language detection would require to spaCy core (which was actually talked about in the issue that inspired this component).