segment-any-text / wtpsplit

Toolkit to segment text into sentences or other semantic units in a robust, efficient and adaptable way.
MIT License
758 stars 44 forks source link

Could not find a mixture for the Universal Dependencies (UD) style in Thai language #107

Closed pavaris-pm closed 1 year ago

pavaris-pm commented 1 year ago

I have been trying to use a wtpsplit in the Thai language by using the 'ud' style as :

# specify language code to be 'th' and style='ud' according to the paper
wtp.split(text, lang_code="th", style='ud')

However, there returned an error that:

ValueError: Could not find a mixture for the style 'ud'.

I also checked in the language_info.csv file and found that the UD style is also supported in the Thai language as UD_Thai-PUD

I have tried on another supported style such as OPUS100 and found that it is usable, except for the UD style that returned me an error. Did this is an error or did I understand something wrong?

Thank you

bminixhofer commented 1 year ago

This is a bit of a subtle problem: We have a test set in UD for Thai, but no train set. That's why we evaluate on UD_Thai-PUD in the paper but we don't train on it (so there is no mixture). You can verify this by checking if the Tables in the Appendix (in this case Table 13) have a number for WtP_PUNCT in the paper.