pierre-24 / pyiso4

Implementation of the ISO 4 standard for journal titles abbreviations in Python.
MIT License
4 stars 2 forks source link

IndexError for certain words at the end of input #12

Closed klb2 closed 2 months ago

klb2 commented 3 months ago

I encountered a weird issue when some words are at the end of the input string.

I found the following examples:

MWE:

from pyiso4.ltwa import Abbreviate
a = Abbreviate.create()
a("IEEE Transactions on Automatic Control")
a("IEEE Transactions on Wireless Communications, to appear")

The following IndexError is thrown:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../python3.12/site-packages/pyiso4/ltwa.py", line 282, in __call__
    abbrv, len_ = self.abbreviate(
                  ^^^^^^^^^^^^^^^^
  File ".../python3.12/site-packages/pyiso4/ltwa.py", line 196, in abbreviate
    return Abbreviate.match_capitalization_and_diacritic(pattern.replacement, guide), len(pattern.pattern)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../python3.12/site-packages/pyiso4/ltwa.py", line 180, in match_capitalization_and_diacritic
    unided = unidecode(original[i])
                       ~~~~~~~~^^^
IndexError: string index out of range
klb2 commented 3 months ago

I found a more relevant example where this issue occurs: IEEE Transactions on Automatic Control throws the same IndexError

klb2 commented 2 months ago

I did some debugging and in the case of the above example, the arguments passed to function match_capitalization_and_diacritic are abbrv = 'control.' and original = 'Control'.

Since the abbreviation is longer than the original (due to the unnecessarily added .), an IndexError is thrown in the loop. PR #14 solves this by reducing the length of the abbreviation if it is longer than the original word, i.e., not actually being an abbreviation.