pablodms / spacy-spanish-lemmatizer

Spanish rule-based lemmatization for spaCy
MIT License
37 stars 6 forks source link

Is very accurate but too slow... #5

Open mmaguero opened 4 years ago

mmaguero commented 4 years ago

Hi @pablodms,

Thanks for this valuable resource :+1:

The lemmatizer is very accurate but is very slow compared with the default spaCy lemmatizer, you have an idea of this behaviour?

Many thanks!

pablodms commented 4 years ago

Hello @mmaguero,

Thanks for the positive feedback,

Currently the lookup table extracted from the file "exceptions.json" (which is actually built from Wiktionary) is really big, so I think the lemmatizer spends a lot of time accessing its data in memory. This is just a guess and I have to measure times to be sure.

If this assumption were true, the only thing that comes to mind is optimizing the runtime by adding more rules to "rules.json" so a lot of entries in the exception lookup table can be removed.

In the meantime, could you give me an example that you know is slow?

Best regards, Pablo

mmaguero commented 4 years ago

You're welcome @pablodms! Great! and thanks for the explanation... Sure! Here is a sample of the texts (and how I use your valuable resource), which are tweets in Spanish (only content words: nouns, verbs, adjectives and adverbs):

time execution over a sample

The default lemmatizer takes 7.43% of the time of the spacy-spanish-lemmatizer...

Thanks for your time!

Langbraue commented 2 years ago

Hi, I think it takes some time to load the json file into memory. How long does it take when keep a reference to nlp and execute a second lemmatisation? Is it faster then?

jzohrab commented 1 year ago

If this assumption were true, the only thing that comes to mind is optimizing the runtime by adding more rules to "rules.json" so a lot of entries in the exception lookup table can be removed.

One possible solution after briefly scanning the code: the words could be sharded across different data files, and then when lookups are done the dictionaries could be lazily loaded as needed. You could do something like md5(word) and then take the first few characters to create a new filename, and on lookup load that file if needed.