segment-any-text / wtpsplit

Toolkit to segment text into sentences or other semantic units in a robust, efficient and adaptable way.
MIT License
625 stars 36 forks source link

Opus100 FR not in mixtures #99

Closed intelliqua closed 12 months ago

intelliqua commented 12 months ago

Hi,

Table 8 in the paper indicates that the training data includes Opus100 FR. However, it seems to not be present in mixtures I checked.

wtp = WtP("wtp-canine-s-12l") wtp.split("Bonjour", lang_code="fr", style="opus100")

bminixhofer commented 12 months ago

Table 8 in the paper shows evaluation dataset sizes. Unfortunately there's no training data for OPUS100 in French so there are no mixtures for it. That's why there are also no numbers for $WtP_T$ and $WtP_U$ on French OPUS100 in Table 10.