pablodms / spacy-spanish-lemmatizer

Spanish rule-based lemmatization for spaCy
MIT License
37 stars 6 forks source link

Very very slow "Generating lemmatization..." #8

Closed jsoladur closed 3 years ago

jsoladur commented 3 years ago

We have a Docker image, in which when building it we execute the command:

python -m spacy_spanish_lemmatizer download wiki

In previous weeks, this command it was slow but not as much as now. Now, the docker build image, not finish command after 2 hours. Before in about 40 minutes the image was compiled...

`user>@<user-zenbook-ubuntu:~$ python3 -m spacy_spanish_lemmatizer download wiki

Downloading wiktionary dump from: https://dumps.wikimedia.org/eswiktionary/latest/eswiktionary-latest-pages-articles.xml.bz2 (it may take some time)

Decompressing dump file: /home//.local/lib/python3.8/site-packages/spacy_spanish_lemmatizer/tmp/eswiktionary-latest-pages-articles.xml.bz2

Parsing downloaded file...

Generating lemmatization... ... ... ... `

What happened to the wiki? Has the lemmatization generation process been modified that now takes longer? How can we solve this problem?

Thank a lot. Best regards

pablodms commented 3 years ago

Hello @josemariasoladuran

The lemmatization generation process has not changed, but the downloaded data did and it caused an endless loop in the code. This is why the "Generating lemmatization..." process never ended. This bug has been fixed and the process should now end normally.

Thanks for reporting this bug. Let me know if the problem persists.