segment-any-text / wtpsplit

Toolkit to segment text into sentences or other semantic units in a robust, efficient and adaptable way.
MIT License
624 stars 36 forks source link

Accuracy: Error in Split (EN) #113

Closed Qubitium closed 5 months ago

Qubitium commented 6 months ago

Linux with 4090 GPU. We found a strange output in split.

from wtpsplit import WtP
wtp = WtP("wtp-canine-s-12l-no-adapters")
wtp.half().to("cuda")

txt="""Title: A monkey's Tale
Rating: O
Words: 104"""
r = wtp.split(txt, "en")
print(r)

Actual:

['Title: ', 'A ', "monkey's Tale\n", 'Rating: O\n', 'Words: 104']

Expected:

['Title: A monkey's Tale\n", 'Rating: O\n', 'Words: 104']

The input contains single new lines. We did not expect "A monkey's Tale" to be split into two sentences.

Perhaps a few training samples with these type of short/list formats will eliminate the corner cases.

bminixhofer commented 5 months ago

Hi!

There was a subtle bug in the hash embeddings which affected some texts in some models. This is fixed in 1.3.0. Now this should give the expected output ['Title: A monkey's Tale\n", 'Rating: O\n', 'Words: 104'].