segment-any-text / wtpsplit

Toolkit to segment text into sentences or other semantic units in a robust, efficient and adaptable way.
MIT License
695 stars 39 forks source link

ValueError when using wtp-canine-s-12l-no-adapters on Danish #93

Closed lise-brinck closed 1 year ago

lise-brinck commented 1 year ago

When using wtp-canine-s-12l-no-adapters for Danish with style "ud", I encounter a ValueError on one specific text.

Specs:

Python version: 3.9.15

Steps to reproduce:

In a clean environment, I only install wtpsplit (and missing requirement pandas).

text = 'Vinderne af Club Syds quiz er fundet\n06 februar 2012 kl. 16.58\nVinderne af Club Syds quiz er fundet. Stort tillykke til de tre vindere af en iPad. Quizzen fortsætter i denne uge, hvor præmierne er tre flotte fladskærms-TV.\nSidste uges rigtige svar var:\nFredericia Stadion (Monjasa Park)\nPræmierne er en iPad til hver af de heldige vindere, og de er nu på vej til:\nJørgen Ladegaard\ni Asperup\nIngelise Smith Hansen\ni Haderslev\nog \nGudrun Zederkof\nLunderskov\n'
model = WtP("wtp-canine-s-12l-no-adapters")
sents = model.split(text, lang_code="da", style="ud")

Output:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.env/lib/python3.9/site-packages/wtpsplit/__init__.py", line 285, in split
    return next(
  File "/.env/lib/python3.9/site-packages/wtpsplit/__init__.py", line 365, in _split
    for text, probs in zip(
  File "/.env/lib/python3.9/site-packages/wtpsplit/__init__.py", line 232, in _predict_proba
    outer_batch_logits = extract(
  File "/.env/lib/python3.9/site-packages/wtpsplit/extract.py", line 175, in extract
    out = model(
  File "/.env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/.env/lib/python3.9/site-packages/transformers/models/canine/modeling_canine.py", line 1521, in forward
    outputs = self.canine(
  File "/.env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/.env/lib/python3.9/site-packages/transformers/models/canine/modeling_canine.py", line 1145, in forward
    molecule_attention_mask = self._downsample_attention_mask(
  File "/.env/lib/python3.9/site-packages/transformers/models/canine/modeling_canine.py", line 1061, in _downsample_attention_mask
    batch_size, char_seq_len = char_attention_mask.shape
ValueError: too many values to unpack (expected 2)
bminixhofer commented 1 year ago

Thanks so much for the minimal repro! And that's really weird, will look into it asap.

bminixhofer commented 1 year ago

Can you try again? There was an error in the model config on HuggingFace. No need for any wtpsplit update, just rerun the code.

lise-brinck commented 1 year ago

Thank you for the quick reply and the quick fix! It seems to be working now :)

bminixhofer commented 1 year ago

Great! Closing this then.