languagetool-org / portuguese-pos-dict

Portuguese POS tagger
GNU Lesser General Public License v2.1
5 stars 2 forks source link

[pt-PT] Adding words to .dic AO45 + AO90 #22

Closed marcoagpinto closed 6 months ago

marcoagpinto commented 8 months ago

Heya, @p-goulart and @susanaboatto

In the coming days, I want to add missing words to both pt-PT dictionaries.

The other day I opened added.txt to see how it looked like, and it had comments.

Where do I add the postags if they are missing? To added.txt?

Thanks!

p-goulart commented 8 months ago
  1. What missing words? Are you talking about the affix files and the new verb forms?

  2. What comments in the added.txt file? The only actual comment there is about entries to be reviewed.

  3. Missing entries are to be added to added.txt only as a hotfix. Ideally, the new words should be added directly to the sources in this repo, to be automatically inflected and included in a recompiled version of the binaries.

marcoagpinto commented 8 months ago

@p-goulart

I mean normal words that appear underlined in red.

I have been saving tons of texts in the editor with missing words in pt-PT.

So, where do I add the pos data?

The verbs derivations I will work on them on the weekend, right now, I am focusing on other important things (they are all important, I know 😛😛😛😛😛😛 )

p-goulart commented 8 months ago

We are not going to release a new version of the dictionaries until the verb morphology is normalised across dialects and we can proceed with the new tokenisation schema. Adding new words now will have no effect whatsoever, so I suggest you set them aside for now. Maybe add them to a GitHub Issue for visibility?

Whatever these words are, I suggest you only add them to added.txt if it is urgent. Very niche words, foreign terms, proper names, etc. can definitely wait.

marcoagpinto commented 8 months ago

We are not going to release a new version of the dictionaries until the verb morphology is normalised across dialects and we can proceed with the new tokenisation schema. Adding new words now will have no effect whatsoever, so I suggest you set them aside for now. Maybe add them to a GitHub Issue for visibility?

Whatever these words are, I suggest you only add them to added.txt if it is urgent. Very niche words, foreign terms, proper names, etc. can definitely wait.

Ahhh… thanks, Pedro, it is not urgent 😄 I was just seeing my list of missing words increasing.

p-goulart commented 8 months ago

Maybe create an issue in this repo, so we can track them better? It's definitely a good thing to know we have a backlog.

p-goulart commented 8 months ago

At any rate, adding hotfixes is done the same way as before, in added.txt and spelling.txt.

If you want a word to be in the binary, you need to add it both to the Hunspell source files (with the appropriate affix flag) and the PoS tagger source files (with the appropriate inflectional pattern).

Specifically on the last one, since it is LT-specific, there is information in the documentation. Do let me know if it needs to be clearer, though.