segment-any-text / wtpsplit

Toolkit to segment text into sentences or other semantic units in a robust, efficient and adaptable way.
MIT License
758 stars 44 forks source link

Any string that isn't a multiple of 4 causes an assert failure #98

Closed intelliqua closed 1 year ago

intelliqua commented 1 year ago

Hi,

Any string that isn't a multiple of 4 causes an assert failure at line 548 in models.py "assert char_encoding.shape[1] % self.conv.stride[0] == 0"

stride is intialised to config.downsampling_rate (4) in modeling_canine.py in transformers lib.

Sample code causing assert failure (length of input string is 35): from wtpsplit import WtP wtp = WtP("wtp-canine-s-12l") wtp.split("This is a test This is another test", lang_code="en")

Sample code that works (with added full-stop that makes the length of input string to become 36): from wtpsplit import WtP wtp = WtP("wtp-canine-s-12l") wtp.split("This is a test This is another test.", lang_code="en")

bminixhofer commented 1 year ago

oof that's a big one, sorry about that. It's a symptom of being lazy and only testing wtp-bert-mini in CI.

It's fixed in v1.2.3, can you confirm it works now?

intelliqua commented 1 year ago

Thanks! That was quick. Yes, it is fixed 👍