repodiac / german_compound_splitter

Compound splitter for German language ("Komposita-Zerlegung") based on large dictionary combined with highly efficient multi-pattern string search
Creative Commons Attribution 4.0 International
22 stars 2 forks source link

IndexError: list index out of range #3

Open danilyef opened 2 years ago

danilyef commented 2 years ago

Unfortunately, the issue with "index out of ranged" is not fixed:

if only_nouns:
    247         # workaround to prevent unwanted behaviour (only nouns are eligible)
--> 248         results[0] = results[0][0].upper() + results[0][1:]
    249         for ri in range(len(results) - 1):
    250             if results[ri].islower():

Example words: 'Rechtsanwält','Schätzmeister','Ferialjob','Infrasturktur' dissection = comp_split.dissect(one_of_example_words, ahocs, make_singular=True)

repodiac commented 2 years ago

Thanks @danilyef - I will look into it. In case, you are more than welcome to provide a PR :-)

PythonJDoe commented 2 years ago

I'm not sure if it's right place to post, but I couldn't find any forum for this so I'm posting here. I'm facing a problem to work with german_compound_splitter. I have a large list of German text which I want to split & use for text mining. texts is the list containing German text which I want to split & store in another list text[]. So I wrote following code

text=list()
for i in range(length):
    s=comp_split.merge_fractions(comp_split.dissect(texts[i], ahocs, make_singular=True))
    text.append(s)

But I'm getting following error

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
/tmp/ipykernel_483/1238906536.py in <module>
      1 text=list()
      2 for i in range(length):
----> 3     s=comp_split.merge_fractions(comp_split.dissect(texts[i], ahocs, make_singular=True))
      4     text.append(s)

/opt/conda/lib/python3.9/site-packages/german_compound_splitter/comp_split.py in dissect(compound, ahocs, only_nouns, make_singular, mask_unknown)
    246     if only_nouns:
    247         # workaround to prevent unwanted behaviour (only nouns are eligible)
--> 248         results[0] = results[0][0].upper() + results[0][1:]
    249         for ri in range(len(results) - 1):
    250             if results[ri].islower():

IndexError: list index out of range

Can you please guide me to resolve the error?

repodiac commented 2 years ago

Hi @PythonJDoe - thanks for your inquiry. Sorry to hear you experienced this error. My time is limited currently, but I will look into it and get back to you asap. It seems to be the same/a similar error and you are the second to mention - so it should be addressed, I agree.

emphasize commented 2 years ago

you don't remove list items in a (for) loop. With this it should be solved

# empties the list entry (if necessary) and removes it afterwards
# with single letters it reverse searches for non-empty entries and applies the letter 
# String.title() capitalizes the first letter

    if only_nouns and results:        
        results[0] = results[0].title()
        for ri in range(len(results) - 1):
            if results[ri].islower():
                merged = results[ri] + results[ri + 1].lower()
                if ahocs.exists(merged):   # does ahocs.exists() disregards capitalization?
                    results[ri] = merged.title()
                    results[ri + 1] = ""
                else:
                    if len(results[ri]) == 1:
                        aritfact_single_letter = results[ri]
                        for i in range(1, ri+1):
                            if results[ri - i]:
                                results[ri - i] += aritfact_single_letter
                                break
                        results[ri] = ""

    results = list(filter(None, results))
repodiac commented 2 years ago

Thanks @emphasize - I appreciate your efforts. I didn't have the time yet to look further into this issue, I am sorry. I'll try to check it asap.