simjanos-dev / LinguaCafe

LinguaCafe is a self-hosted software that helps language learners read foreign languages.
https://simjanos-dev.github.io/LinguaCafeHome/
GNU General Public License v3.0
869 stars 30 forks source link

Adding English as a supported language #87

Closed cs-m closed 7 months ago

cs-m commented 9 months ago

Hi,

First, thank you very much for releasing this project!

Would it be possible to add English to the list of supported languages? I know quite a few people who would be thrilled to use it to improve their language skills.

simjanos-dev commented 9 months ago

Hi!

I've seen English requested somewhere else as well, but there is a problem with adding languages at the moment. Whenever I add a language, the docker image gets larger, and uses more and more RAM. I will need to figure out how to install languages dynamically before I add any more tokenizer language models.

There is a multi language tokenizer model that I can use to add more languages easily, but they would have no automatically generated lemmas(dictionary form of the words). Also for English I will have to modify supported dictionary imports and DeepL search.

I will add English in the future, but it might take a few months to figure out this issue, I want to prioritize a few other features first.

If you have a custom importable .csv dictionary, do not need DeepL, and do not mind having no lemmas, you can use Welsh or Czech language temporarily for English, as they use the multi language tokenizer model.

I'll try to find the time to add it soon.

simjanos-dev commented 9 months ago

@sergiolaverde0 It seems like Spacy models can be loaded directly from the disk. And their license allows me to package them with LinguaCafe.

So I can download them, reupload them as a release, and create a UI that enables/disables languages. When a user enables a language, it will download the Spacy model to the disk to a shared folder.

This will decrease the docker image size a bit, and allow us to add any amount of languages.

@cs-m I think I will do it in march, and add English support.

sergiolaverde0 commented 9 months ago

I'm concerned about whether this will install dependencies too, but is a good start. I will look into this after we sort out tagged images and LXC/podman, so probably not this week.

PsychoThinker commented 8 months ago

I am also interested in learning English with your software. Regarding the size of the docker image, wouldn't it be possible to create several versions of the docker image for different languages? Or is it possible to add support for other languages on your own?

simjanos-dev commented 8 months ago

We thought about that too, but there are too many languages for that. We have found a solution for it. The image will come with a default Spanish language, and you will be able to install languages for it from the admin page. You can read about it in #104.

Currently I'm working on a few other tasks, but I think I will be able to get to it around the end of March(no promises though) or April. After this solution is implemented, I can add languages very quickly.

There will be three groups of languages added soon after it's done:

simjanos-dev commented 7 months ago

@cs-m @PsychoThinker

I've added support for English. It will be out in the next update.

ralienpp commented 1 month ago

I ask for some assistance in setting it up for learning English. I followed the instructions with the aim to add English for a speaker of Romanian, with the expectation that I can click a word in the text and have its translation or explanation shown on the right side in Romanian.

The dictionary generated by dict.cc is a TAB-separated plaintext file of this form:

# EN-RO vocabulary database compiled by dict.cc
# Date and time 2024-10-01 10:07

'til [criticized coll. abbreviation of until] [till]    până la prep    
(a) quarter after five [Am.] [time] cinci și un sfert       
(a) quarter past five [time]    cinci și un sfert   

When I attempt to import it, the software says "The selected file is not a supported dictionary file. Please upload a file from the sources listed in the user manual. "

To the best of my knowledge, I did what the manual told me to do and used an official source. I'd like some help in understanding the process.

simjanos-dev commented 1 month ago

Hi!

It seems like you did everything correctly, and it is supported. Ill check myself after work(6 hours), maybe the dict cc format changed.

Edit: renaming the file file extension could cause issues.

simjanos-dev commented 1 month ago

Hi!

Sorry, I did not have the energy yesterday. I checked it, and it works for me. Can I ask what version of linguacafe do you have installed? You can check it on the bottom of the home page.

I also just realized that there's something that's missing from the user manual: you have to unzip the plain text file, and upload the .txt from dict.cc, not the .zip file.

Did you modify the file in any way?

Is the use-case supported in principle? i.e., is LinguaCafe suitable for learning English? (or is it for improving one's English once they already know it, so the dictionary is not EN-RO, but rather EN-EN)

I missed this yesterday. Yes, EN-RO dict cc translation is supported. You can also add a DeepL machine translation API dictionary for EN-RO, and in v0.14 you will have 2 additional services (MyMemory and LibreTranslate) that will have EN-RO translations. I'm trying to get the beta out as soon as I can, but there was a problem with the build.

in principle

In principle all you would need is enough English to understand the UI, and very basic grammar knowledge, so you can understand the text you want to read to some level, and a suitable dictionary.

ralienpp commented 1 month ago

Thank you for your feedback. Your questions guided me to the solution, so I document it here for the benefit of others.

The last step is the root cause of the problem. If I upload it with the original name, then the file is accepted. I find it a bit counter-intuitive, but it is what it is.