jzohrab / lute

DEPRECATED: LUTE (Learning Using Texts) is a self-hosted web app for learning language through reading, based on Learning with Texts (LWT)
The Unlicense
119 stars 10 forks source link

Add some way to auto-lemmatize (auto-assign parents) a book. #32

Closed jzohrab closed 1 year ago

jzohrab commented 1 year ago

Summary

The current method of defining all terms can be streamlined by some form of "lemmatization", i.e., finding root terms of words.

Currently, Lute treats every word as different: eg, "blancas" and "blancos" are different, though both have the same parent term "blanco", as are "escribo" and "escribieron", though both are forms of the verb "escribir." When I first started out, I didn't mind having to manually make all of these mappings, but as I progress, I feel that's a hassle. I often want to have the parent images available for the child terms, just for my own enjoyment.

It would be nice to have an "auto-lemmatize" feature that can take a given text or book, and automatically map terms to existing parents.

Currently, the only functionality around parent terms, but a significant one in my experience, is the ability to see a bunch of sentences for a term when looking at the references. Eg. for me, the term "albergado" is linked to the parent term "albergar", and when I click on the "sentences" link of "albergado" I get an extensive list of sentences with albergar, albergaba, albergó, albergado, etc etc, which is great b/c I can see the term in my readings. In the future, I can also see this being useful for something like "create Anki cards for only parent terms, with examples of child terms" etc..

First iteration: create a mapping file outside of Lute, then import.

This iteration would be good enough for me, at present!

Sample code using spacy-stanza

This only finds lemma that are different than the original term.

import stanza
import spacy_stanza

# Download the stanza model if necessary
# print("downloading model ...");
# stanza.download("es")

# Initialize the pipeline
nlp = spacy_stanza.load_pipeline("es")

text = """
Los acomodé contra las paredes, pensando en la comodidad y no en la estética.
"""

# with nlp.select_pipes(enable=['tok2vec', 'tagger', 'attribute_ruler', 'lemmatizer']):
doc = nlp(text)

# for token in doc:
#     print(token.text, token.lemma_, token.pos_, token.dep_, token.ent_type_)
# print(doc.ents)
lemmatized = [ token for token in doc if token.text != token.lemma_ ];
for token in lemmatized:
    print(token.text, token.lemma_)

Run with python3 -W ignore ex.py (when all dependencies are installed in a python venv):

Output:

Los él
acomodé acomodar
las el
paredes pared
pensando pensar
la el
la el

The lemmatizing code takes a while to load due to the extensive data, but that's ok. If people run the process outside of Lute, they'll understand the processing needs. And this is a first-pass idea anyway.

This data could be loaded into a file and then passed back to Lute for magic processing.

ref code links for spacy

Future iterations

Obviously, having Lute manage this would be great, but it implies a full installation of some form of Python and spaCy or similar. This could be done with Docker containers too, managed by compose.

I don't think this would need a constantly running server for the lemma process, it could just run a "docker command" style microcontainer that just processes some input (list of terms) and returns the mapping.

However, possibly in the future it would be nice to do the lemmatization on-the-fly, which would need some kind of REST API server running. This might require a bunch of config though, to get the corpus(es) necessary for users with their specific languages.

jzohrab commented 1 year ago

Pushed branch parent_mapping with some starting code, service layer for doing mapping. Still need everything else ... could even do this with a "symfony command" or script just to start.

jzohrab commented 1 year ago

More detail on the lemmatizing I have in mind:

jzohrab commented 1 year ago

Done in thedevelop branch, docs on this are at https://github.com/jzohrab/lute/wiki/Bulk-Mapping-Parent-Terms. Will work with it for a bit on my instance before launching, but I think it's good to go.

jzohrab commented 1 year ago

Launched in v2.0.2.