jzohrab commented 1 year ago

Summary

The current method of defining all terms can be streamlined by some form of "lemmatization", i.e., finding root terms of words.

Currently, Lute treats every word as different: eg, "blancas" and "blancos" are different, though both have the same parent term "blanco", as are "escribo" and "escribieron", though both are forms of the verb "escribir." When I first started out, I didn't mind having to manually make all of these mappings, but as I progress, I feel that's a hassle. I often want to have the parent images available for the child terms, just for my own enjoyment.

It would be nice to have an "auto-lemmatize" feature that can take a given text or book, and automatically map terms to existing parents.

Currently, the only functionality around parent terms, but a significant one in my experience, is the ability to see a bunch of sentences for a term when looking at the references. Eg. for me, the term "albergado" is linked to the parent term "albergar", and when I click on the "sentences" link of "albergado" I get an extensive list of sentences with albergar, albergaba, albergó, albergado, etc etc, which is great b/c I can see the term in my readings. In the future, I can also see this being useful for something like "create Anki cards for only parent terms, with examples of child terms" etc..

First iteration: create a mapping file outside of Lute, then import.

This iteration would be good enough for me, at present!

Lemmatizing could, at first, be handled outside of Lute, using a tool like spaCy. This could generate a mapping file of terms in a given text/book, child -> parent. See code below.
The resulting file could be imported into Lute, and mappings done. New children (status = unknown) could be created and auto-assigned to the parent, with the same status as the parent (or with status = 1, maybe).
Potentially, new parents could also be made ... but that gets into new term creation, which I'm really not sure how much I want to get into!
The lemmatization could also be applied after-the-fact to existing terms, but then things might get weird with people creating terms with a given status being mapped to parents with different status ... not sure! For the first iteration, it could just work when importing a new book, perhaps.

Sample code using spacy-stanza

This only finds lemma that are different than the original term.

import stanza
import spacy_stanza

# Download the stanza model if necessary
# print("downloading model ...");
# stanza.download("es")

# Initialize the pipeline
nlp = spacy_stanza.load_pipeline("es")

text = """
Los acomodé contra las paredes, pensando en la comodidad y no en la estética.
"""

# with nlp.select_pipes(enable=['tok2vec', 'tagger', 'attribute_ruler', 'lemmatizer']):
doc = nlp(text)

# for token in doc:
#     print(token.text, token.lemma_, token.pos_, token.dep_, token.ent_type_)
# print(doc.ents)
lemmatized = [ token for token in doc if token.text != token.lemma_ ];
for token in lemmatized:
    print(token.text, token.lemma_)

Run with python3 -W ignore ex.py (when all dependencies are installed in a python venv):

Output:

Los él
acomodé acomodar
las el
paredes pared
pensando pensar
la el
la el

The lemmatizing code takes a while to load due to the extensive data, but that's ok. If people run the process outside of Lute, they'll understand the processing needs. And this is a first-pass idea anyway.

This data could be loaded into a file and then passed back to Lute for magic processing.

ref code links for spacy

Future iterations

Obviously, having Lute manage this would be great, but it implies a full installation of some form of Python and spaCy or similar. This could be done with Docker containers too, managed by compose.

I don't think this would need a constantly running server for the lemma process, it could just run a "docker command" style microcontainer that just processes some input (list of terms) and returns the mapping.

However, possibly in the future it would be nice to do the lemmatization on-the-fly, which would need some kind of REST API server running. This might require a bunch of config though, to get the corpus(es) necessary for users with their specific languages.

jzohrab commented 1 year ago

Pushed branch parent_mapping with some starting code, service layer for doing mapping. Still need everything else ... could even do this with a "symfony command" or script just to start.

jzohrab commented 1 year ago

More detail on the lemmatizing I have in mind:

if an existing Term ("dogs") has a root form ("dog"), and that root form exists, that should be set as the parent. ("dogs" has "dog" as parent)
if an existing Term ("dogs") has a root form in the mapping file or function, and that root form does not exist, create the root form and map it.
if a new term in a book ("cats") has a root form ("cat") and that root form exists, the new term will be created, and then mapped to the existing parent, with a note in the new term saying that it was auto-created, and it will be linked to the parent
if a new term in a book ("parrots") has a root form ("parrot"), but that root form does not exist, don't do anything!

jzohrab commented 1 year ago

Done in thedevelop branch, docs on this are at https://github.com/jzohrab/lute/wiki/Bulk-Mapping-Parent-Terms. Will work with it for a bit on my instance before launching, but I think it's good to go.

jzohrab commented 1 year ago

Launched in v2.0.2.

jzohrab / lute

Add some way to auto-lemmatize (auto-assign parents) a book. #32

Summary

First iteration: create a mapping file outside of Lute, then import.

Sample code using spacy-stanza

ref code links for spacy

Future iterations