Accuracy: Error in Split (EN)

segment-any-text / wtpsplit

Toolkit to segment text into sentences or other semantic units in a robust, efficient and adaptable way.

MIT License

624 stars 36 forks source link

Linux with 4090 GPU. We found a strange output in split.

from wtpsplit import WtP
wtp = WtP("wtp-canine-s-12l-no-adapters")
wtp.half().to("cuda")

txt="""Title: A monkey's Tale
Rating: O
Words: 104"""
r = wtp.split(txt, "en")
print(r)

Actual:

['Title: ', 'A ', "monkey's Tale\n", 'Rating: O\n', 'Words: 104']

Expected:

['Title: A monkey's Tale\n", 'Rating: O\n', 'Words: 104']

The input contains single new lines. We did not expect "A monkey's Tale" to be split into two sentences.

Perhaps a few training samples with these type of short/list formats will eliminate the corner cases.

segment-any-text / wtpsplit

Accuracy: Error in Split (EN) #113