tingofurro / summac

Codebase, data and models for the SummaC paper in TACL
https://arxiv.org/abs/2111.09525
Apache License 2.0
81 stars 20 forks source link

IndexError: list index out of range #12

Open UntotaufUrlaub opened 1 year ago

UntotaufUrlaub commented 1 year ago

Hi,

I encountered an error:

File "/add_score.py", line 53, in add_score
    res = function(["? I haven't had a birthday since 2007. I have a b-day in October and it's almost completely ignored."], ["",])
  File "/add_score_summac.py", line 28, in <lambda>
    "my_summacZS_batched": lambda summs, docs: modelZS.score(docs, summs)['scores'],
  File "/usr/local/lib/python3.9/site-packages/summac/model_summac.py", line 351, in score
    score = self.score_one(source, gen)
  File "/usr/local/lib/python3.9/site-packages/summac/model_summac.py", line 322, in score_one
    image = self.imager.build_image(original, generated)
  File "/usr/local/lib/python3.9/site-packages/summac/model_summac.py", line 113, in build_image
    generated_chunks = self.split_text(generated, granularity=gran_sum)
  File "/usr/local/lib/python3.9/site-packages/summac/model_summac.py", line 94, in split_text
    return self.split_sentences(text)
  File "/usr/local/lib/python3.9/site-packages/summac/model_summac.py", line 71, in split_sentences
    sentences = nltk.tokenize.sent_tokenize(text)
  File "/usr/local/lib/python3.9/site-packages/nltk/tokenize/__init__.py", line 107, in sent_tokenize
    return tokenizer.tokenize(text)
  File "/usr/local/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1276, in tokenize
    return list(self.sentences_from_text(text, realign_boundaries))
  File "/usr/local/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1332, in sentences_from_text
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "/usr/local/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1332, in <listcomp>
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "/usr/local/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1322, in span_tokenize
    for sentence in slices:
  File "/usr/local/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1421, in _realign_boundaries
    for sentence1, sentence2 in _pair_iter(slices):
  File "/usr/local/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 318, in _pair_iter
    prev = next(iterator)
  File "/usr/local/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1395, in _slices_from_text
    for match, context in self._match_potential_end_contexts(text):
  File "/usr/local/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1382, in _match_potential_end_contexts
    before_words[match] = split[-1]
IndexError: list index out of range

I think it is caused by the leading "? ", which might lead in an empty sentence within the metric. Is this to be expected and explained somewhere or is this a bug?

kind regards

Edit: I circumvented (not fixed) this is issue for now using this code:

match = re.match(r"(\s*[.?!]+\s)", summaries[i])
if match:
    summaries[i] = summaries[i][len(match.group(1)):]

because empty leading sentences with other symbols than "?" also caused this issue.