Open danilyef opened 2 years ago
Thanks @danilyef - I will look into it. In case, you are more than welcome to provide a PR :-)
I'm not sure if it's right place to post, but I couldn't find any forum for this so I'm posting here. I'm facing a problem to work with german_compound_splitter. I have a large list of German text which I want to split & use for text mining. texts is the list containing German text which I want to split & store in another list text[]. So I wrote following code
text=list()
for i in range(length):
s=comp_split.merge_fractions(comp_split.dissect(texts[i], ahocs, make_singular=True))
text.append(s)
But I'm getting following error
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
/tmp/ipykernel_483/1238906536.py in <module>
1 text=list()
2 for i in range(length):
----> 3 s=comp_split.merge_fractions(comp_split.dissect(texts[i], ahocs, make_singular=True))
4 text.append(s)
/opt/conda/lib/python3.9/site-packages/german_compound_splitter/comp_split.py in dissect(compound, ahocs, only_nouns, make_singular, mask_unknown)
246 if only_nouns:
247 # workaround to prevent unwanted behaviour (only nouns are eligible)
--> 248 results[0] = results[0][0].upper() + results[0][1:]
249 for ri in range(len(results) - 1):
250 if results[ri].islower():
IndexError: list index out of range
Can you please guide me to resolve the error?
Hi @PythonJDoe - thanks for your inquiry. Sorry to hear you experienced this error. My time is limited currently, but I will look into it and get back to you asap. It seems to be the same/a similar error and you are the second to mention - so it should be addressed, I agree.
you don't remove list items in a (for) loop. With this it should be solved
# empties the list entry (if necessary) and removes it afterwards
# with single letters it reverse searches for non-empty entries and applies the letter
# String.title() capitalizes the first letter
if only_nouns and results:
results[0] = results[0].title()
for ri in range(len(results) - 1):
if results[ri].islower():
merged = results[ri] + results[ri + 1].lower()
if ahocs.exists(merged): # does ahocs.exists() disregards capitalization?
results[ri] = merged.title()
results[ri + 1] = ""
else:
if len(results[ri]) == 1:
aritfact_single_letter = results[ri]
for i in range(1, ri+1):
if results[ri - i]:
results[ri - i] += aritfact_single_letter
break
results[ri] = ""
results = list(filter(None, results))
Thanks @emphasize - I appreciate your efforts. I didn't have the time yet to look further into this issue, I am sorry. I'll try to check it asap.
Unfortunately, the issue with "index out of ranged" is not fixed:
Example words: 'Rechtsanwält','Schätzmeister','Ferialjob','Infrasturktur' dissection = comp_split.dissect(one_of_example_words, ahocs, make_singular=True)