Closed jzohrab closed 1 year ago
Pushed branch parent_mapping
with some starting code, service layer for doing mapping. Still need everything else ... could even do this with a "symfony command" or script just to start.
More detail on the lemmatizing I have in mind:
Done in thedevelop
branch, docs on this are at https://github.com/jzohrab/lute/wiki/Bulk-Mapping-Parent-Terms. Will work with it for a bit on my instance before launching, but I think it's good to go.
Launched in v2.0.2.
Summary
The current method of defining all terms can be streamlined by some form of "lemmatization", i.e., finding root terms of words.
Currently, Lute treats every word as different: eg, "blancas" and "blancos" are different, though both have the same parent term "blanco", as are "escribo" and "escribieron", though both are forms of the verb "escribir." When I first started out, I didn't mind having to manually make all of these mappings, but as I progress, I feel that's a hassle. I often want to have the parent images available for the child terms, just for my own enjoyment.
It would be nice to have an "auto-lemmatize" feature that can take a given text or book, and automatically map terms to existing parents.
Currently, the only functionality around parent terms, but a significant one in my experience, is the ability to see a bunch of sentences for a term when looking at the references. Eg. for me, the term "albergado" is linked to the parent term "albergar", and when I click on the "sentences" link of "albergado" I get an extensive list of sentences with albergar, albergaba, albergó, albergado, etc etc, which is great b/c I can see the term in my readings. In the future, I can also see this being useful for something like "create Anki cards for only parent terms, with examples of child terms" etc..
First iteration: create a mapping file outside of Lute, then import.
This iteration would be good enough for me, at present!
Sample code using spacy-stanza
This only finds lemma that are different than the original term.
Run with
python3 -W ignore ex.py
(when all dependencies are installed in a python venv):Output:
The lemmatizing code takes a while to load due to the extensive data, but that's ok. If people run the process outside of Lute, they'll understand the processing needs. And this is a first-pass idea anyway.
This data could be loaded into a file and then passed back to Lute for magic processing.
ref code links for spacy
Future iterations
Obviously, having Lute manage this would be great, but it implies a full installation of some form of Python and spaCy or similar. This could be done with Docker containers too, managed by compose.
I don't think this would need a constantly running server for the lemma process, it could just run a "docker command" style microcontainer that just processes some input (list of terms) and returns the mapping.
However, possibly in the future it would be nice to do the lemmatization on-the-fly, which would need some kind of REST API server running. This might require a bunch of config though, to get the corpus(es) necessary for users with their specific languages.