simjanos-dev / LinguaCafe

LinguaCafe is a self-hosted software that helps language learners read foreign languages.
https://simjanos-dev.github.io/LinguaCafeHome/
GNU General Public License v3.0
847 stars 25 forks source link

Mark German separable verbs that belong together. #178

Open simjanos-dev opened 5 months ago

simjanos-dev commented 5 months ago

It is possible to mark verbs in German that has a prefix in the tokenizer python script.

If a word is marked, and has the same lemma as another word in the same sentence, I think they 99% belong together, and it can be highlighted in some way. Maybe adding bottom border for the two words, and dashed bottom border for the words between them.

Possibly the scope of the word pair highlighting could be reduced to sentence parts separated by commas, for higher accuracy.

I do not completely understand how separable verbs work in German, so I'm not completely sure that this solution works.

jacovanc commented 5 months ago

Let me know if you have any questions about separable verbs in German (I have B2 or C1 reading/writing in German). Don't know much about how Spacy / lemmas work though.

simjanos-dev commented 5 months ago

@jacovanc Thank you! :)

cblanken commented 2 weeks ago

I'm learning German and wanted to let you know I 100% support this endeavor :raised_hands:

I don't know how much help I can be, but I know Python though I've never worked with spaCy or NLP stuff. I'll have to do some research.

cblanken commented 2 weeks ago

@simjanos-dev I think your proposed solution would probably work in most situations, but after digging through the spaCy docs a bit, separable verbs should be detectable without needing to check the lemmas.

When spaCy tokenizes the text, it also marks dependencies of the token. For German, all separable verb prefixes (SVP) should be marked with that dependency. So if you identify any SVP tokens in a sentence, calling the .head property will get the "parent" verb token.

Example

Here's an example script with some example sentences I copied from this Verbklammer article.

import spacy

nlp = spacy.load("de_core_news_sm")

text = """
Unsere Dozenten bieten regelmäßig Kurse zur Gestaltung von Webseiten, zum Schreiben von E-Books, zur Vermarktung und zu Verdienstmög­lichkeiten an.
Unsere Einrichtungslösungen drücken sich aus in exklusivem Design, erstklassigen Materialien, handwerklicher Qualität, kurz: in Schönheit und passender Funktionalität.
Deshalb stockt die Regierung die Förderung für Baumaßnahmen, mit denen Barrieren im Haus und in der Wohnung reduziert werden, deutlich auf.
"""

doc = nlp(text)
svps = [tok for tok in doc if tok.dep_ == "svp"]
for svp in svps:
    print(f"PREFIX={svp}({svp.i}), VERB={svp.head}({svp.head.i})")

This identifies the prefix and the main verb without relying on matching lemmas or reducing the sentence scope.

PREFIX=an(21), VERB=bieten(3)
PREFIX=aus(28), VERB=drücken(26)
PREFIX=auf(70), VERB=stockt(49)
simjanos-dev commented 2 weeks ago

Sorry for the late response, didnt have time yet.

It is already done in the tokenizer.py file. They are assigned the same lemma, they just have to be highlighted in the ui.

We could mark those words in the db. Im not sure its necessary, but it can be done.

cblanken commented 2 weeks ago

Oh nice, guess I should have read through that first. :sweat_smile:

I don't really see a need for marking them in the DB, unless you want to make it possible for users to override specific instances. For instance, in the case of an incorrect identification from spaCy. Most language learners probably won't be able be able to spot those inconsistencies though unless they're very advanced. At which point it wouldn't even be needed.

Just the UI highlighting probably makes sense for an initial implementation.

Feel free to ping me whenever you get it setup and I can do some testing. :smiley: