simjanos-dev / LinguaCafe

LinguaCafe is a self-hosted software that helps language learners read foreign languages.
https://simjanos-dev.github.io/LinguaCafeHome/
GNU General Public License v3.0
888 stars 32 forks source link

Turkish language package changed. #332

Open simjanos-dev opened 3 months ago

simjanos-dev commented 3 months ago

https://huggingface.co/turkish-nlp-suite/tr_core_news_md/tree/main

The Turkish language model was renamed. It also seems to have a Spacy version requirement of >=3.4.2,<3.5.0, which was present even before the name change. The tokenizer.py script dies when I try to install after changing the url, and it messes up the model directory with a spacy 3.4 version. It also stops functioning after an attempted Turkish install and restarting the script.

@sergiolaverde0 I don't know yet how to fix this issue. I tagged you in case you are interested, and have an idea.

sergiolaverde0 commented 3 months ago

I think pinning the version when installing the packages in DockerfilePythpn should solve this, but I won't be able to write a fix until tomorrow or the day after.

simjanos-dev commented 3 months ago

I think pinning the version when installing the packages in DockerfilePythpn should solve this, but I won't be able to write a fix until tomorrow or the day after.

I use the v13.0 latest image for personal use, it has Spacy 3.7.5. I would be surprised if we went from 3.4 to 3.7.5 just by not pinning a version number.

Pinning it to an older version would solve this, but not sure if we should use an older spacy version. Also other installable packages use 3.7.0 spacy version based on their url. I think this change could also mess up the model folder for people who already have installed models.

I think maybe we should also host these files on linguacafe github if possible.

Thank you so much for your help with it! Also please take your time, it is not urgent.

simjanos-dev commented 3 months ago

If it's something we cannot fix reasonably simply, maybe we could solve it by replace it with Stanza if Turkish is available.

sergiolaverde0 commented 3 months ago

Well I have tried the simple solution of just updating the link, and installing Turkish does indeed downgrade spacy which triggers #323. I'm actually ashamed I didn't notice this before, it is quite big.

I have a "hotfix" that enables you to install Turkish at the expense of breaking every other language, but we need a better solution. For the time being this should be announced as a known issue so people know it happens, know if it already affected them, and can decide whether to use Turkish anyway or not.

Pinning it to an older version would solve this, but not sure if we should use an older spacy version. Also other installable packages use 3.7.0 spacy version based on their url

We could downgrade those packages in theory so that they are compatible with the older spacy, but I don't like the idea and will 100% break every other extra package which is probably worse.

I think maybe we should also host these files on linguacafe github if possible

I thought about that at some point but was unsure given their size. at this point it is probably worth giving it a second chance.

we could solve it by replace it with Stanza if Turkish is available

Actually not a bad idea, but I need to actually go back to the open PR. Cobbling something together that fits out use case is far easier than making something fit for upstream and will also be useful for the prior point, however it will still take some time.

As a side comment lxml[html_clean] had a breaking change which I already addressed, in case you try a dev build and it fails for you.