languagetool-org / portuguese-pos-dict

Portuguese POS tagger
GNU Lesser General Public License v2.1
5 stars 2 forks source link

info: some numbers about PT dictionaries and other languages #9

Closed jaumeortola closed 7 months ago

jaumeortola commented 11 months ago
Number of lines in tagger dicts in different languages
FR  634004 
CA 1265005
PT 1489625
ES 3603260 

The FR dictionary seems small in comparison, but in fact is enormous. We do all the FR spelling with this. The main difference between FR and CA/PT is the quantity of verbal forms. Catalan and Portuguese have more than twice as many. Spanish still has many more verbal forms, with joined enclitic pronouns.

Portuguese spelling dicts:

Lines in PT spelling dicts before tokenization:
  9960408 pt_AO1.txt
 10485607 pt_BR1.txt
  9163224 pt_MZ1.txt
 11535687 pt_PT1.txt

Lines in PT spelling dicts after tokenization, removing enclitics:
  5047492 pt_AO2.txt
  2787300 pt_BR2.txt
  4283146 pt_MZ2.txt
  6360406 pt_PT2.txt