morphgnt / sblgnt

morphological tagging of the SBL Greek New Testament
118 stars 31 forks source link

functional dependency issues in lemmatization #32

Open jtauber opened 8 years ago

jtauber commented 8 years ago

(I mean functional dependency in the database normalization sense, nothing to do with linguistics)

The number of unique form/tag/lemmas is 18819

$ cat *.txt | awk '{print $2,$3,$6,$7}' | sort | uniq | wc -l
   18819

The number of unique form/tags is 18808

$ cat *.txt | awk '{print $2,$3,$6}' | sort | uniq | wc -l
   18808

The following 11 form/tags pairs have more than one lemma. Some of these are known issues (and some more subtle than others) but we should make a decision on each of them and correct.

A- ----ASF- μακράν => μακράν or μακρός
A- ----ASN- ταχύ => ταχύ or ταχύς
A- ----ASNC ὕστερον => ὕστερον or ὕστερος
A- ----ASNC ἀνώτερον => ἀνώτερον or ἀνώτερος
A- ----DSN- ἱερῷ => ἱερόν or ἱερός
A- ----GSN- ἱεροῦ => ἱερόν or ἱερός
V- 2AAS-P-- συνῆτε => συνίημι or σύνιημι
V- 2PAD-P-- θαρσεῖτε => θαρρέω or θαρσέω
V- 2PAD-S-- ἄγε => ἄγε or ἄγω
V- 3AAI-S-- προώρισε(ν) => προοράω or προορίζω
V- 3IMI-P-- ἤρχοντο => ἄρχω or ἔρχομαι
jtauber commented 8 years ago

Some of these have been resolved in discussions with @emg in the child issues.

jtauber commented 8 years ago

http://jktauber.com/2015/12/15/functional-dependency-morphgnt-table/ describes a new tool to help with this sort of analysis. The latest results are:

$ ./dep.py -v 2,3,6 7
A- ----ASN- ταχύ {'ταχύ', 'ταχύς'}
A- ----ASNC ὕστερον {'ὕστερος', 'ὕστερον'}
A- ----ASF- μακράν {'μακράν', 'μακρός'}
A- ----ASNC ἀνώτερον {'ἀνώτερος', 'ἀνώτερον'}
V- 3IMI-P-- ἤρχοντο {'ἄρχω', 'ἔρχομαι'}
5
jcuenod commented 8 years ago

I don't know whether this is a separate issue but using this data with https://github.com/billmounce/dictionary is problematic in places because his lemmatization doesn't match yours (e.g. you have "ἔξεστι(ν)" where mounce has "ἔξεστιν", "Ἀχαΐα" - mounce has "Ἀχαία"). Sometimes matches can be made by ignoring accents or stripping out brackets but more commonly it's about whether the lexical form is middle or active (e.g. μνηστευομαι vs μνηστευω). I don't have any suggestions but maybe you could advise on using these two datasets together. I generated the attached list of non-matches (ignoring case and accenting but not brackets) for lemmas and included the first reference wherever they occur in case you're interested (there are just short of 300).

nonmatchinglemmas.txt

jtauber commented 8 years ago

@jcuenod that's really a separate issue. Of course I'd be delighted if you'd help on https://github.com/morphgnt/morphological-lexicon/tree/master/projects/lemmatization_differences