Weeding/editing pre-built data?

Hi Helma, sorry for the slow reply. I'm back from my holidays and am doing some work on Diogenes now. I'm not sure what you mean by "swap out lemmata". What were you trying to do? Here are some points that might help:

If you are worries about the file greek-lemmata.txt, that is only used for morphological search, not for analysing inflected forms.
The file used for analysing inflected forms is greek-analyses.txt (and its index file greek-analyses.idt). That file is created by the script utils/make_greek_lemmata.pl, which is where the lemma numbers get baked into that file.
The lemma numbers represent the byte offset within the LSJ file where the lemma is found, in order to read the entry from disk directly.
The lemma numbers come from the file build/lsj-index.txt, which maps lemmata to byte offsets, and which is created by the script utils/index_lsj.pl from the XML of the lexicon.

My guess is that you are trying to remove spurious parses from the analysis of inflected forms (because you hate that kind of thing :-). If so, the route forward is not to edit the file greek-analyses.txt, which is automatically generated from several sources and might change again in the future. For example, when I swapped the Perseus version of LSJ for the Logeion version, all of the lemmata offsets changed and I just regenerated that file with the new lemma numbers.

The right way to go would be to fix the errors at their source, which is in the file build/grc.morph, which is the output of running Morpheus (the old Perseus tagger) on a list of words from the TLG. That file is not in git because it is automatically generated and it is not distributed with Diogenes because it is not used by the application; it is only used in the process of generating greek-analyses.txt.

I don't know if you have a working copy of Morpheus, but if not I can send you a copy of grc.morph. To give you an idea, the format of the file is like this:

sumpi/tnei
<NL>V sumpi/tnw  pres ind mp 2nd sg             poetic  w_stem</NL><NL>V sumpi/tnw  pres ind act 3rd sg         poetic  w_stem</NL>
sumpi/tnousin
<NL>P sumpi/tnw  pres part act masc/neut dat pl attic epic doric ionic  nu_movable poetic       w_stem</NL><NL>V sumpi/tnw  pres ind act 3rd pl      attic epic doric ionic  nu_movable poetic       w_stem</NL>
sumpiw/n

Inflected form on one line and then parses on the next line within <NL> tags. If you would be interested in contributing a patch to fix this file, I'm sure all Diogenes users would be very grateful!

Alternatively, since you presumably have already done this work for Logeion, I wonder if there might be a way for me to import all of your corrections into this file automatically, if you would be willing to share your data. That would be a much better way to do it than to fiddle around with manual corrections!

pjheslin / diogenes

Weeding/editing pre-built data? #87