simjanos-dev / LinguaCafe

LinguaCafe is a self-hosted software that helps language learners read foreign languages.
https://simjanos-dev.github.io/LinguaCafeHome/
GNU General Public License v3.0
797 stars 23 forks source link

What is needed to add additional languages? #3

Open cccyberwolke opened 6 months ago

cccyberwolke commented 6 months ago

This project is very interesting, i am wondering what i need to do to add an additional language to it, in my case i want to use it for Finnish.

Maybe, if we have a list of task needed for a new language, it might be possible to add more dynamically or in bulk.

simjanos-dev commented 6 months ago

I will update the project with further languages. I'll add Dutch, Italian and French tomorrow.

Sorry, I didn't expect so much interest, and I cannot add everything so quickly that people want. Also I don't have any documentation yet. It might take a while.

I think you need to modify these things to add a new language:

I don't know if there's more. It seems like a lot, but it is done pretty fast, if the language has spaces, and it is written from left to right and has latin alphabet.

I'll try my best to add more languages.

systemcrash commented 6 months ago

@cccyberwolke Perhaps take a look at 55dd704da21af43ee70867ab42c6863a8189952b - and see whether adding a language in your local copy works.

systemcrash commented 6 months ago

@simjanos-dev thanks for a very interesting tool.

If adding a language requires changes in several places, it probably requires those several places coalescing to one, so that language addition is less error prone.

PS please put a link to this repo from the github web page so it's easier to find 👓

simjanos-dev commented 6 months ago

@systemcrash

I think there are some things that could be better, like /config/langapp.php and language selection dialog. But there are parts that are completely disconnected, and cannot be pulled together. Like adding a flag image, adding the model for Spacy in the dockerfile and modifying the tokenizer python service.

I'll try to make commits in a way today that shows the process step by step.

I've added the github link yesterday. Someone posted only the overview page on HN without the github link. I've only posted it on reddit where I also added the github link, and didn't know it was shared at other places as well.

simjanos-dev commented 6 months ago

I've added support for Chinese, Dutch, Finnish, French, Italian and Korean.

Korean lemmas are separated with + symbols and I think Chinese will need a custom font type.

There are 7 numbered commits named "Adding language support 1". They document the steps on how to add additional languages.

You can add any language this way that Spacy supports and written left-to-right and top-to-bottom. I will add more in the near future.

simjanos-dev commented 5 months ago

So, adding languages is a bit more difficult than I thought so. There are multiple different language codes which has to be added for different parts to work: deepl, dict cc dictionary, jellyfin, an internat database naming short form code.

Some of these codes are scattered around the files, so I'll rewrite the code a bit and add them to the config/langapp.php file, which will be renamed to config/linguacafe.php. Every php and vue supposed to use these codes, and there shouldn't be any language lists/codes elsewhere, except the python and dockerfiles, which would be very difficult to unify.

I will add a bunch of languages that Spacy do not support using the Spacy multi language model, so they won't increase RAM usage and docker image size. I haven't tested it yet, but I think it should work for languages with spaces.

These languages that use the multi language model will not support lemmatisation and tagging genders for words. I will add more Spacy supported languages and may look for further tokeniser libraries in the future, but before that I want to find a way to enable/disable language support for each language, so the image size and RAM usage won't increase too much on slower PC-s.

First two languages will be Welsh and Czech.

arvigeus commented 4 months ago

I would really try that if it had Vietnamese. No dice, Spacy doesn’t have it.

simjanos-dev commented 4 months ago

I found this. Even if Spacy does not have it, we can add a separate library for it.

One of the next thing we are working on with @sergiolaverde0 is making languages installable, so we can add as many libraries as we like, and it won't increase the docker image size indefinitely for everyone. We were talking about it in #104.

I'll try to add Vitnamese, but it might take quite a while to get there. Eventually every possible language will be added.

simjanos-dev commented 3 months ago

Some update on adding more languages.

I've added English, Latin and Greek as supported languages. They will be out in the next update.

I can add more languages now that use the multi language model, or have no additional dependencies. We still have a problem with adding new languages that have large dependencies.

I will keep adding languages that I can.

jacovanc commented 3 months ago

I would also love to add Turkish. I can see Spacy doesn't support it out of the box unfortunately. Let me know if there's anything I can do to help.

simjanos-dev commented 3 months ago

@jacovanc

I found some, but they are large. We can add it after we have the language installation feature.

gelbziesel commented 2 months ago

Is there any update on this? Trying to install Croatian.

sergiolaverde0 commented 2 months ago

Working on it but is quite big feature. We have settled on a plan and are currently executing it, I'm literally testing the changes on the Python side right now, but I can't promise a release date for it.

simjanos-dev commented 2 months ago

We are working on it currently. Even if we will have issues installing languages, Croatian is a small language model, and nothing blocks adding it. I just got sidetracked with other features in the latest update.

I'm planning on adding Catalan, Croatian, Danish, Lithuanian, Macedonian, Polish, Portuguese, Romanian, Slovenian next, if I can find a dictionary for them, or DeepL supports them. These are the missing small spacy language models that are missing and can be easily added.

If we succeed with the language install feature, I also want to add Thai, Vietnamese and Tagalog, because they also were requested.

But also the same, no promise on release dates.

simjanos-dev commented 2 months ago

Added Thai and Turkish.

I'm currently working on the 9 languages mentioned above. After that it will probably get harder to add new languages with Lemmas, because most of them have no Spacy language models, and will have to write a separate tokenizer function for each language using a separate python library for each.

simjanos-dev commented 2 months ago

I finished going through some languages.

Unsure:

Languages supported by text parser, but will need more work:

Added languages:

Other languages that will need their own python library. These can be added, but will take some work for each one.:

It's possible that some languages that are not mentioned will not have lemmas.

gelbziesel commented 2 months ago

Thank you so much! This went way faster than I expected! I tried using it on the language-install feature branch, but with all of the new languages, I always get this error when I try to import something:

An error has occurred while importing your text.

But I am aware you are still working on it, so I will try again when it is ready. Maybe I also messed something up on my part. Nevertheless, thanks again, this is awesome!

simjanos-dev commented 2 months ago

Thank you so much! This went way faster than I expected!

It will take some time to make it into an actual version update. I think I'll make this into its own update.

Maybe I also messed something up on my part. Nevertheless, thanks again, this is awesome!

It should be working, I've tested all the new languages manually.

Did you have a development environment before? If so, you should rebuild the docker image: docker compose -f ./docker-compose-dev.yml build or docker compose -f ./docker-compose-dev-macos.yml build. It's possible that you have the new code, but the docker image does not have the new language models installed.

Edit: You should also use docker compose up -d --force-recreate after that, to update the docker-compose.yml file changes.

gelbziesel commented 2 months ago

It worked! Lemmatization isn't perfect for Croatian, but it works for most words with no issues. This will be a big help for my reading :)

simjanos-dev commented 2 months ago

I'm happy it will be useful. :) Sadly for Croatian I could only found dict cc dictionary and DeepL api does not support it either. I'll look for custom dictionaries in the future to add support for, but it will take a long time. I found one, but it's english-croatian, not sure If it would be smart to reverse it.

simjanos-dev commented 2 months ago

The new update has been released with the new languages. I will continue to add new languages, it will get a bit more difficult after adding Catalan and Lithuanian, because every language will need it's own separate python library and a separate function.

I plan on adding every language I can, but it will take a lot of time to get there. I keep track of language requests, and try to focus on the more requested ones. Sorry, if your language is not in linguacafe yet.

simjanos-dev commented 1 month ago

There is some progress related to adding more languages in #255. Even without this nothing is blocking me from adding languages, it just takes time.