pjheslin / diogenes

Diogenes: an environment for reading Latin and Greek
https://d.iogen.es/d
Other
56 stars 10 forks source link

Weeding/editing pre-built data? #87

Closed helmadik closed 2 years ago

helmadik commented 2 years ago

I was looking at the Greek analyses and Greek lemmata. I can of course remove single analyses wholesale, if a more sensical one shows up after the first one, but I haven't figured out how to swap out lemmata. Where does the lemma number, e.g. 100546453 for sumpi/tnw, get baked in? Any way to handle this? What are the files I'm failing to look at? Pointers appreciated!

pjheslin commented 2 years ago

Hi Helma, sorry for the slow reply. I'm back from my holidays and am doing some work on Diogenes now. I'm not sure what you mean by "swap out lemmata". What were you trying to do? Here are some points that might help:

My guess is that you are trying to remove spurious parses from the analysis of inflected forms (because you hate that kind of thing :-). If so, the route forward is not to edit the file greek-analyses.txt, which is automatically generated from several sources and might change again in the future. For example, when I swapped the Perseus version of LSJ for the Logeion version, all of the lemmata offsets changed and I just regenerated that file with the new lemma numbers.

The right way to go would be to fix the errors at their source, which is in the file build/grc.morph, which is the output of running Morpheus (the old Perseus tagger) on a list of words from the TLG. That file is not in git because it is automatically generated and it is not distributed with Diogenes because it is not used by the application; it is only used in the process of generating greek-analyses.txt.

I don't know if you have a working copy of Morpheus, but if not I can send you a copy of grc.morph. To give you an idea, the format of the file is like this:

sumpi/tnei
<NL>V sumpi/tnw  pres ind mp 2nd sg             poetic  w_stem</NL><NL>V sumpi/tnw  pres ind act 3rd sg         poetic  w_stem</NL>
sumpi/tnousin
<NL>P sumpi/tnw  pres part act masc/neut dat pl attic epic doric ionic  nu_movable poetic       w_stem</NL><NL>V sumpi/tnw  pres ind act 3rd pl      attic epic doric ionic  nu_movable poetic       w_stem</NL>
sumpiw/n

Inflected form on one line and then parses on the next line within <NL> tags. If you would be interested in contributing a patch to fix this file, I'm sure all Diogenes users would be very grateful!

Alternatively, since you presumably have already done this work for Logeion, I wonder if there might be a way for me to import all of your corrections into this file automatically, if you would be willing to share your data. That would be a much better way to do it than to fiddle around with manual corrections!