harunzafer / nuve

Natural Language Processing Library for Turkish in C#
MIT License
98 stars 15 forks source link

longest Turkish word #70

Open garfieldnate opened 9 years ago

garfieldnate commented 9 years ago

I found this cool word on Wikipedia:

muvaffakiyetsizleştiricileştiriveremeyebileceklerimizdenmişsinizcesine

It seems the system can't handle it. That might be fine, but it might also be a good future benchmark :)

garfieldnate commented 9 years ago

It does seem to handle muvaffakiyetsizleştirecekleri, though.

hrzafer commented 9 years ago

When the last 'cesine' part is removed, system can handle the word:

muvaffakiyetsizleştiricileştiriveremeyebileceklerimizdenmişsiniz

This is because, it is defined that the CAsInA suffix can not come after a person suffix. So except the first one, I'm not sure that the following words are valid Turkish words:

insanmışcasına [valid] insanmışımcasına insanmışsıncasına insanmışızcasına insanmışsınızcasına [the same case with the wikipedia example] insanmışlarcasına

If they are, it is trivial to change the definition.

By the way system can handle the following:

muvaffakiyetsizleştiricileştiriveremeyebileceklerimizdenmişçesine
hrzafer commented 9 years ago

I'll need to check my Turkish morphology book which is like 1300 pages :)

garfieldnate commented 9 years ago

Neat! 17 out of 18 isn't bad. Nice long analysis: muvaffakiyet/ISIM sUz/IY_SIFAT_sUz lAş/IY_FIIL_lAş DUr/FY_ETTIRGENDUr(U)t yUcU/FYTANIMLAMA(y)UcU lAş/IY_FIIL_lAş DUr/FY_ETTIRGENDUr(U)t yUver/FC_YFTEZLIK(y)Uver yAmA/FC_YFYETERSIZLIK(y)AmA yAbil/FC_YFYETERLILIK(y)Abil yAcAk/FIILIMSISIFAT(y)AcAK lAr/IC_COGUL_lAr UmUz/IC_SAHIPLIKBIZ(U)mUz DAn/IC_HAL_AYRILMA_DAn ymUş/EKFIILRIVAYET(y)mUş CAsInA/IY_ZARF_CAsInA