radomirbosak / duden

CLI for http://duden.de dictionary written in Python
MIT License
99 stars 19 forks source link

Worttrennung can contain multiple values separated by comma #186

Open tbm opened 1 year ago

tbm commented 1 year ago

Some words have multiple word separation options, e.g. https://www.duden.de/rechtschreibung/Zivildienstleistender has:

Worttrennung Zi|vil|dienst|leis|ten|der, Zi|vil|dienst Leis|ten|der

This isn't handled great:

>>> w = duden.get("Zivildienstleistender")
>>> w.word_separation
['Zi', 'vil', 'dienst', 'leis', 'ten', 'der, Zi', 'vil', 'dienst Leis', 'ten', 'der']
ackhh commented 8 months ago

There is "Zivildienstleistender" and "Zivildienst Leistender". Each has one word separation.

radomirbosak commented 6 months ago

@tbm Thank you for reporting this!

I didn't know that this problem existed. It is indeed not great.

As for how to handle this:

So my idea is:

  1. Return only word separation of first option, i.e. ['Zi', 'vil', 'dienst', 'leis', 'ten', 'der']
  2. Add another property, e.g. .word_separation_variants which would return all options split.

This way, a kind of backward compatibility would be preserved, but also different variants could be read.

I see that the .name property is handled fine

In [3]: w.name
Out[3]: 'Zivildienstleistender'

but the other variant "Zi­vil­dienst Leis­ten­der" cannot be read. Maybe this can be another property e.g. .name_variants

This sounds like it would not be hard to implement.

tbm commented 6 months ago

I'm not so good at designing good APIs. However, your suggestion to leave the current field as is with the first option and then another another option with all options sounds good to me.