segment-any-text / wtpsplit

Toolkit to segment text into sentences or other semantic units in a robust, efficient and adaptable way.
MIT License
753 stars 44 forks source link

Opus100 FR not in mixtures #99

Closed intelliqua closed 1 year ago

intelliqua commented 1 year ago

Hi,

Table 8 in the paper indicates that the training data includes Opus100 FR. However, it seems to not be present in mixtures I checked.

wtp = WtP("wtp-canine-s-12l") wtp.split("Bonjour", lang_code="fr", style="opus100")

bminixhofer commented 1 year ago

Table 8 in the paper shows evaluation dataset sizes. Unfortunately there's no training data for OPUS100 in French so there are no mixtures for it. That's why there are also no numbers for $WtP_T$ and $WtP_U$ on French OPUS100 in Table 10.