sanskrit-lexicon / alternateheadwords

Prepare list of alternate headwords for all Cologne dictionaries
1 stars 0 forks source link

STC embeded headwords #6

Open drdhaval2785 opened 7 years ago

drdhaval2785 commented 7 years ago

In #5, @funderburkjim raised the possibility of incorporating embedded headwords into the fold of this repository. He also suggested that STC may be a good starting point. I have started exploring stc.txt file. currently the code is under heavy development. There are three steps of derivation as of now.

data/STC/STCehw0.txt (STC embedded Headwords in AS)
data/STC/STCehw1.txt (STC headwords in SLP1)
data/STC/STCehw2.txt (STC headwords in SLP1, with suggestions)

File under examination is STCehw2.txt. Entry therein are of the format headword@subheadwordtag@combined@code

A rough documentation of codes for resolutions are given at https://github.com/sanskrit-lexicon/alternateheadwords#embedded-headwords

Statistics as of 06 Oct 2016 Code 0 - 28547 entries Code 1 - 19961 entries Total - 48508 entries

Statistics as of 09 Oct 2016 Total 8090 entries with code 0 Total 26578 entries with code 1 Total 8484 entries with code 2 Total 1835 entries with code 3 Total 753 entries with code 4 Total 415 entries with code 5 Total 331 entries with code 8 Total 117 entries with code 99

drdhaval2785 commented 7 years ago

I will keep on adding the known resolutions as and when I come up with them and increase code 1 / possible outcomes.

drdhaval2785 commented 7 years ago

23504 / 48508 explained till now.

funderburkjim commented 7 years ago

Regarding the 'AS' words.

One of the things which I think would simplify life in programmatic access of various dictionaries, including STC, would be to replace the AS encoding with Unicode where possible (and I suspect that for STC, at least, it would be possible everywhere.)

The reason for this is that the AS encoding is hard to read, because it is not widely accepted as a way to encode Latin alphabet characters adorned with diacritics.

AS was a reasonable solution in its time, which was before the wide acceptance of Unicode (and in particular the UTF-8 encoding of Unicode). However, now that Unicode is well-supported (although, in my opinion, imperfectly so, especially where accents or multiple diacritics are concerned), perhaps we should do our transcoding from AS to unicode at the level of the X.txt digitizations. In other words, use IAST (Unicode) instead of AS in stc.txt.

I think this might be one of those transformations that is bijective, by which is meant that one could also do the inverse transformation (Unicode to AS), to retrieve the original AS form without loss of information. If this bijectivity proves out (and only writing programs can determine if any little features inhibit bijectivity), then there is no logical difference between a unicode-based IAST representation and an AS representation. (Note: In STC, the so-called 'AS' is also used to represent French diacritics - the acute, grave and circumflex accents common with French vowels. But still these should also be changed to Unicode, I am suggesting.)

@drdhaval2785 Any thoughts on this?

In any case, whether the Sanskrit in STC is represented in its current AS coding, or whether it is represented in the Unicode representation of IAST, you will need a transcoder (to SLP1).

gasyoun commented 7 years ago

06 Dec 2016

Oct is more optimistic.

23504 / 48508

This is why I adore NLP (in general) and Dhaval (in detail).

AS encoding with Unicode where possible (and I suspect that for STC, at least, it would be possible everywhere.)

As far as I remember not everything can be moved to Unicode. Hope STC is simpler and no non-existing unicode combinations are there. Anyway, even a pseudo-Unicode that would depend on a pseudo web font would work better than that AS.

AS was a reasonable solution in its time

Right, and none of the details were lost.

imperfectly so, especially where accents or multiple diacritics are concerned

And so will remain. Peter has done a lot and still the rock is mostly there. They just do not care.

accents common with French vowels. But still these should also be changed to Unicode, I am suggesting.

These should be Unicode, no additional issues arise.

drdhaval2785 commented 7 years ago

33859 / 48508 explained.

drdhaval2785 commented 7 years ago

Removed pure verb entries from the list. Their brackets usually have verb forms. Not needed as headwords. 48508 dropped to 46603.

gasyoun commented 7 years ago

brackets usually have verb forms

Makes sense.

drdhaval2785 commented 7 years ago

37908 / 46603 entries as of https://github.com/sanskrit-lexicon/alternateheadwords/commit/5048f876d53e46189e5433a3e1ac7f7e325658bf commit.

drdhaval2785 commented 7 years ago

Now individual corrections take a long time. Dropping it at this. 38416 / 46603 is the final score. Leaving those 8000 odd entries as they are (with 0 code).

The correctness of various tagging rules has to be examined.

gasyoun commented 7 years ago

Leaving those 8000 odd entries as they are (with 0 code).

Sure.

38416 / 46603 is the final score.

Intermediate working draft.

drdhaval2785 commented 7 years ago

Code 6 and 7 are not worthwhile. Removed them. Also there are many code 8 resolutions which were already available in sanhw1.txt. Changed them to code 2. They gave mostly false positives.

After this change

Total 8090 entries with code 0
Total 26578 entries with code 1
Total 8484 entries with code 2
Total 1835 entries with code 3
Total 753 entries with code 4
Total 415 entries with code 5
Total 331 entries with code 8
Total 117 entries with code 99

So 26578+8484 = 35062 out of 46603 are resolved into known headwords (sanhw1.txt). They can be merged without much issue. They may have small false positives. We can take that risk. At least the entry is a known headword.