repodiac / german_compound_splitter

Compound splitter for German language ("Komposita-Zerlegung") based on large dictionary combined with highly efficient multi-pattern string search
Creative Commons Attribution 4.0 International
22 stars 2 forks source link

fix index out of range #2

Closed sebag90 closed 3 years ago

sebag90 commented 3 years ago

This should fix the problem that some words could throw an index out of range error

repodiac commented 3 years ago

Hi @sebag90, thanks for the issue filing and the PR! (and again: sorry for the delay, did not see/get your message(s) ... :(

Unfortunately, I cannot reproduce - I just downloaded the most recent version of german.dic (last change: 5/4/2021) and used this code:

from german_compound_splitter import comp_split

compound = 'Pflanzenart'
input_file = '/tmp/german.dic'
ahocs = comp_split.read_dictionary_from_file(input_file)

dissection = comp_split.dissect(compound, ahocs, only_nouns=True)
print('SPLIT WORDS (plain):', dissection)
print('SPLIT WORDS (post-merge):', comp_split.merge_fractions(dissection))

The output was this:

Loading data file - /tmp/german.dic
Dissect compound:  Pflanzenart
SPLIT WORDS (plain): ['Pflanze', 'n', 'Art']
SPLIT WORDS (post-merge): ['Pflanze', 'n', 'Art']

Can you maybe retry or provide the german.dic file?

repodiac commented 3 years ago

I tried different things with "Pflanzenart" and variations of it. No error. When you look at the way the list results is modified it also looks pretty difficult (not to say impossible) to run into an index out of range error, I suppose.

Again, if you can provide me with an example and maybe the precise dictionary file you used, I can try to reproduce. "Unfortunately", it (still) works for me so far...

repodiac commented 3 years ago

As described under section "Issues", I will close the PR without merging it.