sspanak / tt9

A T9 keyboard for Android devices with a hardware keypad.
Apache License 2.0
252 stars 43 forks source link

Support for Catalan Language #592

Closed Roconx closed 2 months ago

Roconx commented 3 months ago

I was wondering if you could add support for the Catalan language. I have been using this app for some days now an if has been flawless for the English and Spanish! The the only thing I'm missing is to be able to type in native language (Catalan).

It's pretty similar to Spanish but it has apostrophes both at the start of words (l'aigua) and at the end (Dona-t’ho) as well as - in the middle (anti-inflamatori).

Possible Keys (adapted from Portuguese):

dictionaryFile: ca-ES-utf8.csv
name: Català / CA
layout:
  - [SPECIAL] # 0
  - [PUNCTUATION] # 1
  - [a, b, c, ç, à] # 2
  - [d, e, f, è, é] # 3
  - [g, h, i, í, ï] # 4
  - [j, k, l] # 5
  - [m, n, o, ò, ó] # 6
  - [p, q, r, s] # 7
  - [t, u, v, ú, ü] # 8
  - [w, x, y, z] # 9

We also have l·l (Col·lecció), I don't know if that should be in the same key as l (which would be the best option in my opinion, even if theoretically it's 3 keys long, as it is pronounced pretty much the same as a single l).

As for punctuation, it's the same as Spanish but without ¡ and ¿, so I guess PUNCTUATION_FR?

I have been looking for good dictionaries for Catalan but I have not had much success. Here are the best two I have found: This one is the most promising. It has also listed the most common words after each word (which is interesting but I don't think it will be usefult in our case) and has words that contain - or ' listed as a whole word (I don't know if it will be useful, but I guess it is better than nothing). There are other dictionaries (both Catalan and other languages) in that repository, so take a look if this one is not useful for our use-case. And this one. Which has - and ' words listed as a whole new word. Sadly, I have been unable to find an official dictionary.

I can also provide both menu and documentation translations for both Catalan and Spanish.

Thank you very much for providing a keyboard as good as this one free of charge! I'll make sure to make a donation when I can spare some money!

Let me know if you need any extra information!

Roconx commented 3 months ago

Went ahead and implemented it myself!

sspanak commented 3 months ago

Thank you for your effort in making a valid dictionary. It looks very good, however, there are some issues, because of which I can not merge it right away.

First it's the letter ŀ / Ŀ, which is missing from your proposed layout. I think this is unacceptable and some people will definitely want it instead of attempting to type L. or other alternatives. Also, this way it is impossible to type addresses, proper names and whatnot. I believe we must add it to the 5-key characters.

As for the punctuation, the French variant includes the French quote marks: « ... ». I suppose you have noticed that, because in the pull request you have used "PUNCTUATION" which has the English quote marks "...". We can go with whatever is appropriate.

Regarding the words with apostrophes and hyphens, I usually extract them as separate words to save storage space and for maximum flexibility. You may have noticed how in English, the possessive 's is a separate word, as opposed to having all nouns with 's at the end. I haven't reviewed your dictionary yet, but if needed, I'll transform it the same way, extracting l', d' and whatever you guys have (I am not familiar with the language at all, but I'll figure it out).

The word lists you have found look quite OK. They have a lot of words, which is good for the typing experience. But if there is a respected university or an institute that regulates the language or has released something like "the big dictionary of Catalan", it would be the best source. Such official "big" dictionaries exist for many languages and are great to use, because they have no spelling mistakes and sometimes contain rarely used words. On the contrary, the word lists generated from subtitles (the OpenBoard/Gboard ones, by Helium314, from your first link), often have mistakes or nonsense words. In my experience they work mostly fine for simple everyday conversations, but that's about it. So, if you know of such "big" dictionary, please recommend it.

And as for the Spanish documentation, well, let's skip that for now, unless you are willing to participate actively in the project and update it for every release. I still change it a lot.

Having said all this, I'll need some time to take care of other technical problems, and only after that, I'll review, finalize and merge the Catalan PR.

Roconx commented 3 months ago

First it's the letter ŀ / Ŀ, which is missing from your proposed layout. I think this is unacceptable and some people will definitely want it instead of attempting to type L. or other alternatives.

The reason I did not add it to the letters, is because l·l is not a letter but an l followed by · and then another l, and I did not know if I could put two letters there, l and ·, but I now realize that · should just be a symbol...

As for the punctuation, the French variant includes the French quote marks: « ... »

I indeed did that on purpose, as I had never seen those used in Catalan, but upon further research, I have found out that Catalan has those...

Regarding the words with apostrophes and hyphens, I usually extract them as separate words to save storage space and for maximum flexibility. You may have noticed how in English, the possessive 's is a separate word, as opposed to having all nouns with 's at the end.

Sure, I'll do that!

The word lists you have found look quite OK. They have a lot of words, which is good for the typing experience. But if there is a respected university or an institute that regulates the language or has released something like "the big dictionary of Catalan", it would be the best source. Such official "big" dictionaries exist for many languages and are great to use, because they have no spelling mistakes and sometimes contain rarely used words. On the contrary, the word lists generated from subtitles (the OpenBoard/Gboard ones, by Helium314, from your first link), often have mistakes or nonsense words. In my experience they work mostly fine for simple everyday conversations, but that's about it. So, if you know of such "big" dictionary, please recommend it.

Yes, the wordlists were pretty bad, they had lots of words that simply do not exist in Catalan or are misspelled, upon further research, I have been able to find an official dictionary.

I have made another commit with all those fixes, I'll be trying it over the next few days to check that everything works correctly!

sspanak commented 3 months ago

The reason I did not add it to the letters, is because l·l is not a letter but an l followed by · and then another l, and I did not know if I could put two letters there, l and ·, but I now realize that · should just be a symbol...

There is a symbol, here it is. I highly recommend using that instead of combining regular L + a middle dot. The combining characters may cause Backspace to fail sometimes, or may cause weird word suggestions.

Yes, the wordlists were pretty bad, they had lots of words that simply do not exist in Catalan or are misspelled, upon further research, I have been able to find an official dictionary. I have made another commit with all those fixes, I'll be trying it over the next few days to check that everything works correctly!

Go ahead, but I am pretty sure 32k words are way too few even for simple conversations. English, even being a simple language, with almost no verb conjugations, with no genders for adjectives and so on, has 173k words. Yet, sometimes, it feels some words are missing.

I can see you are enthusiastic about adding Catalan, so you may try these sources:

  1. Winedt. They are usually very good, but do not contain many words.
  2. hunspell spell checker. I usually use them to verify huge wordlists like the ones you attempted to use above. They may also contain mistakes, but are usually useful for filtering out most of the garbage from another wordlist. On Linux, you can build the entire wordlist text file from the .dic and .aff files using unmunch language.dic language.aff. Not sure about other operating systems though.
  3. https://github.com/elastic/hunspell - yet another hunspell repository. However, they seem not to update the wordlists in years.
Roconx commented 3 months ago

There is a symbol, here it is. I highly recommend using that instead of combining regular L + a middle dot. The combining characters may cause Backspace to fail sometimes, or may cause weird word suggestions.

How would that work? Would ŀ get automatically replaced to ?

Thanks for the dictionaries, I'll take a look! The one I found is indeed too small, I missed words the first time I tried to type.

sspanak commented 3 months ago

How would that work? Would ŀ get automatically replaced to l·?

Um... no. Isn't the single character the preferred way of typing? From what I read on Wikipedia, it seems so. And I was under the impression you even wanted to be able to type ŀl in a single key press? Or am I missing something?

From technical perspective, the single character is more optimal. You will be able to ŀl by just pressing 5-5, instead of 5-1-5. Also, the suggestions will be much more accurate. TT9 does suggest some garbage words when it tries to figure out letter+punctuation combinations.

I am not sure if there is any advantage in using two separate characters. Perhaps, maybe, some websites or apps will prefer that, I don't know. But some other may as well prefer a single character.

Roconx commented 3 months ago

Um... no. Isn't the single character the preferred way of typing? From what I read on Wikipedia, it seems so. And I was under the impression you even wanted to be able to type ŀl in a single key press? Or am I missing something?

Then using · as a separate symbol would be the better option, while I would argue that it is better to have the ŀ as a separate letter, it is not used at all, at least not in Catalan, in fact, I did not even know that it existed until you pointed it out. The computer keyboard has the · as a separate symbol, and does not have the ŀ character. So using it would be non standard. The same Wikipedia page you mentioned makes no mention of that character when choosing the Catalan language.

As for the single key press thing, I meant to map the key 5 to the whole l·l, as I did not consider the possibility to add · as a symbol.

I am not sure if there is any advantage in using two separate characters. Perhaps, maybe, some websites or apps will prefer that, I don't know. But some other may as well prefer a single character.

I do not think there is any advantage in using l·l over ŀl at all, it is in fact 1 more key press, but at least on google, it is impossible to find anything that has been written using l·l by typing ŀl so I would guess that all other websites would behave in a similar way.

On the other hand, I have been trying the Winedt dictionary you recommended, and it's pretty good! I'll try it for a few days before making a final decision, but its pretty big (about 300k words) and I have not missed any word at all (at least for now)! Thanks for the recommendation, I would have never found it!

sspanak commented 3 months ago

Fair enough, let's use a separate middle dot on the 1-key. Pardon my ignorance, it is very difficult to make the right decision when I don't speak the language and I am not familiar with the culture.

As for the dictionary, I usually merge several files to provide the maximum amount of words. I would clean up the original 800k word file and merge it with the one from Winedt.

I've also quickly checked your last commit and it seems the dictionary is missing contractions or other punctuation combinations ( being the brightest example!). There are handy scripts for processing wordlists here, in the repo, but if you find it difficult to use them, I can take care later.

Roconx commented 3 months ago

Thanks, I'll try to clean up and merge some dictionaries in the following days, I'll let you know if I have any problem!