Open simjanos-dev opened 5 months ago
Let me know if you have any questions about separable verbs in German (I have B2 or C1 reading/writing in German). Don't know much about how Spacy / lemmas work though.
@jacovanc Thank you! :)
I'm learning German and wanted to let you know I 100% support this endeavor :raised_hands:
I don't know how much help I can be, but I know Python though I've never worked with spaCy or NLP stuff. I'll have to do some research.
@simjanos-dev I think your proposed solution would probably work in most situations, but after digging through the spaCy docs a bit, separable verbs should be detectable without needing to check the lemmas.
When spaCy tokenizes the text, it also marks dependencies of the token. For German, all separable verb prefixes (SVP) should be marked with that dependency. So if you identify any SVP tokens in a sentence, calling the .head property will get the "parent" verb token.
Here's an example script with some example sentences I copied from this Verbklammer article.
import spacy
nlp = spacy.load("de_core_news_sm")
text = """
Unsere Dozenten bieten regelmäßig Kurse zur Gestaltung von Webseiten, zum Schreiben von E-Books, zur Vermarktung und zu Verdienstmöglichkeiten an.
Unsere Einrichtungslösungen drücken sich aus in exklusivem Design, erstklassigen Materialien, handwerklicher Qualität, kurz: in Schönheit und passender Funktionalität.
Deshalb stockt die Regierung die Förderung für Baumaßnahmen, mit denen Barrieren im Haus und in der Wohnung reduziert werden, deutlich auf.
"""
doc = nlp(text)
svps = [tok for tok in doc if tok.dep_ == "svp"]
for svp in svps:
print(f"PREFIX={svp}({svp.i}), VERB={svp.head}({svp.head.i})")
This identifies the prefix and the main verb without relying on matching lemmas or reducing the sentence scope.
PREFIX=an(21), VERB=bieten(3)
PREFIX=aus(28), VERB=drücken(26)
PREFIX=auf(70), VERB=stockt(49)
Sorry for the late response, didnt have time yet.
It is already done in the tokenizer.py file. They are assigned the same lemma, they just have to be highlighted in the ui.
We could mark those words in the db. Im not sure its necessary, but it can be done.
Oh nice, guess I should have read through that first. :sweat_smile:
I don't really see a need for marking them in the DB, unless you want to make it possible for users to override specific instances. For instance, in the case of an incorrect identification from spaCy. Most language learners probably won't be able be able to spot those inconsistencies though unless they're very advanced. At which point it wouldn't even be needed.
Just the UI highlighting probably makes sense for an initial implementation.
Feel free to ping me whenever you get it setup and I can do some testing. :smiley:
It is possible to mark verbs in German that has a prefix in the tokenizer python script.
If a word is marked, and has the same lemma as another word in the same sentence, I think they 99% belong together, and it can be highlighted in some way. Maybe adding bottom border for the two words, and dashed bottom border for the words between them.
Possibly the scope of the word pair highlighting could be reduced to sentence parts separated by commas, for higher accuracy.
I do not completely understand how separable verbs work in German, so I'm not completely sure that this solution works.