sanskrit-lexicon / COLOGNE

Development of http://www.sanskrit-lexicon.uni-koeln.de/
18 stars 3 forks source link

AP90 IAST conversion completion #159

Closed funderburkjim closed 7 years ago

funderburkjim commented 7 years ago

Recall that AP90's intrinsic IAST coding uses in part italicized letters. While the previous conversion to IAST took this into account, it was discovered that the previous approach missed numerous cases. These include

A significant attempt was made to remedy these problems.

The end result was that 3022 lines of the digitization were changed; these may be reviewed in temp_allchanges.txt

gasyoun commented 7 years ago

AP90's intrinsic IAST coding uses in part italicized letters

Similar to MD, and that's the worst possible transliteration.

chh -> ch

We do have a list of what was changed to what in which dictionary, right, Jim?

funderburkjim commented 7 years ago

Methodology

Words starting with an upper-case letter

Informal review of the text suggests that most of the IAST-Sanskrit words are capitalized. This work limited the analysis to just this class of words. There are roughly 25-30,000 words starting with a upper-case letter from the English alphabet.

Exclude hyphenated words

The AP90 digitization respects line breaks appearing in the text. About 15% of the 200,000 lines of the digitization have a '-' at the end of the line, indicating a hyphenated word. For this analysis, such hyphenated words were excluded.

Exclude Devanagari words

There are many Devanagari words or phrases embedded within the entries of AP90; these are coded with SLP1 which uses both upper-case and lower case Latin alphabet letters. These transliterated Devanagari words must of course be side-stepped.

funderburkjim commented 7 years ago

'ch' is the modern IAST form for the letter whose SLP1 coding is 'C' (aspirated hard palatal).

In IAST conversion, I'm aiming for consistency in the result, so that subsequent identification of Sanskrit words represented in IAST will have a firm basis in the IAST spelling.

In other words, I'm more interested in the result than in the wildly varying forms appearing in the original digitizations. E.g., who cares that Wilson used "ch'h" and AP90 used "chh" for the modern 'ch', and that both used 'ch' for the modern 'c'.

funderburkjim commented 7 years ago

Methodology, continued

Exclude short words

Only words with at least 3 characters are considered.

Initial filtering : English and Non-Sanskrit exclusions

The above points suggest how words in each line of the digitization may be selected with Regular expressions. As mentioned, there would be roughly 30,000 such words.

The large majority of these Capitalized words are English (e.g. words at the beginning of an English sentence). Also, some of these words are Latin (plant or animal genus-species), anglicized spelling of words in one of the modern Indian languages, abbreviations of literary sources, etc.

Thus, the initial filtering excludes words identified as non-Sanskrit. This identification is specifically done by generating a file of non-Sanskrit words. This is done in part with consulting the Enchant English dictionary, in part by including the non-Sanskrit words developed in a similar IAST conversion of the Wilson dictionary. And finally, by an iterative process of contextual review. This file is called 'english.txt', since it mostly consists of English words.

The initial number of distinct non-excluded words is on the order of 2000, and these words occur in approximately 12000 lines of the digitization.

contextual isolation

Many of the changes must be made within a very limited context, in order to avoid unwanted changes that would be present in a broader context. For instance, we want to change 'sh' to 'ṣ' in Vishṇu, but we don't want to make this change within the English word 'Worship'.

The way the current methodology controls the context is by making an intermediate version of the digitization lines which contain one of the filtered words. An example may clarify.

For line 806:
<>of Vishṇu, the first of the three      PREVIOUS CODING
<>of {_Vishṇu_}, the first of the three  INTERMEDIATE CODING. The word is marked by {_..._}

For line 77653:
<>per of Gaṇeśa. {#--tyaM#} {@1@} Worship of   PREVIOUS CODING
<>per of {_Gaṇeśa_}. {#--tyaM#} {@1@} Worship of   INTERMEDIATE CODING
      Note Gaṇeśa  is marked for further examination (actually it is correct and will require no change)
      Note Worship is NOT marked for further examination -- it has been identified as non-Sanskrit
               from the english.txt file

For line 28553:
<>or stars in the Ursa Major {#--darSa-#}     PREVIOUS CODING
<>or stars in the {_Ursa_} Major {#--darSa-#}  INTERMEDIATE CODING
     This is relative to an early version of english.txt, that didn't know that Ursa was a Latin
     word. In the final version of english.txt, Ursa *was* included. 
     Keep this example in mind as the further steps of the methodology are described.
funderburkjim commented 7 years ago

Change identification and implementation

temp_allchanges.txt shows all the changes (3022) made in the process of completing the conversion to modern IAST. But, what is the process by which these change transactions were generated ?

analysis_nonenglish.txt

The filtering identifies a large (2647) collection of words that potentially need to be converted to IAST. Refer to analysis_nonenglish.txt (This is another file in the same Gist).

In this file we see several pieces of information for each word. Consider the 3rd record for explanation of the fields:

12 Abhimanyu Abhimanyu abhimanyu aBimanyu OK
12 = # of instances 
Abhimanyu  = The original spelling of the word in the digitization
Abhimanyu  = The modern IAST spelling of this word.   No difference for this word
                        This uses a particular transcoding: ap90as_roman.xml
abhimanyu  = lower-case of modern IAST spelling -- needed for next form
aBimanyu   = SLP1 form of previous.
                      This uses a second transcoding:  romanuni_slp1.xml
OK = status.  The 'OK' means that the SLP1 spelling was recognized as a Sanskrit word.
                      There are two ways this can happen:
                      - word is a headword in mw or ap90
                      - word is recognized as a Sanskrit word by some other means.
                         A section of the program generating the file  
                         (namely, the program analysis_apostrophe.py) has a list of these.
                         For instance in the 1st record abDijO was classed as OK since
                         this word is the m. dual nominative form of headword abDija)
In addition to the OK/TODO values of the status, another value seen in OK-NONSAN. as in
     2 Behār Behār behār behAr OK-NONSAN
     This came from previous work with Wilson Dictionary, in which 'behAr' was posted as
      one of a list of non-sanskrit words (in this case a place name).
funderburkjim commented 7 years ago

Resolution of the TODO cases, phase 1

Words classified as OK (or OK-NONSAN) are considered to be solved, in the sense that we have an explanation for the IAST spelling.

There are 1126 words whose classification began as TODO -- we consider these cases as unsolved at present. We don't know whether they are

The rest of the work is to resolve these TODO cases.

Initial additions to english.txt

Quite a few of thee words are identified by eye as English words, that for some reason we missed in our use of Enchant. E.g. Wednesday, Wifehood, Williams, Adding these words to english.txt will remove them from analysis_nonenglish.txt in the next iteration.

The Wilson IAST conversion collected numerous Latin words, mostly present in genus-species taxonomy. Some of these also appear in AP90. So adding them to english.txt removes some more from analysis_nonenglish.txt.

Making these additions to english.txt, and rerunning, the analysis_nonenglish.txt file now has 2003 cases, and 482 are classified as TODO. The revised analysis file appears as analysis_nonenglish1.txt in the same Gist.

funderburkjim commented 7 years ago

Resolution of TODO cases, phase 2

The comparitively easy part is over. From here on, the remaining TODO cases have to be examined one by one, and some resolution made. This resolution might be:

funderburkjim commented 7 years ago

With these changes made, the meta-line conversion of #158 can be resumed.

funderburkjim commented 7 years ago

Extensions to above work.

A similar kind of analysis could be done for some of the excluded cases. One category would be hyphenated words. There are 4895 lines in the digitization where a capitalized word ends with a hyphen at the end of a line. Probably some of these need adjustment.

There may be a need for analyzing words beginning with a lower-case letter, or there may be no such cases -- I'm not sure but suspect the latter (no cases).

funderburkjim commented 7 years ago

As the above description indicates, the conversion to modern IAST requires a greater in-depth analysis than one might at first anticipate. Nonetheless, it seems to me that such a conversion is worth the trouble.

When I first began to study Sanskrit, and to use dictionaries, I recall that the various spellings (some in Devanagari, some using Latin letters with diacritics) were quite confusing. One source of the confusion when trying to correlate the Devanagari and Latin-letter spellings was the irregularity of the Latin-letter spellings. If the Cologne versions of the various dictionaries use a common IAST spelling in place of the various IAST conventions of the original dictionaries, then this should increase the utility of the digital forms we are working on.

gasyoun commented 7 years ago

In other words, I'm more interested in the result than in the wildly varying forms appearing in the original digitizations.

Agree. Enough of wildlife.

who cares that Wilson used "ch'h" and AP90 used "chh" for the modern 'ch', and that both used 'ch' for the modern 'c'.

I care, but only to document the temperature in the clinic.

This is relative to an early version of english.txt, that didn't know that Ursa was a Latin word. In the final version of english.txt, Ursa was included.

How about adding a Latin word list https://www.math.ubc.ca/~cass/frivs/latin/latin-dict-full.html ? Additional to English. Or rather http://latin-dictionary.net/list/letter/a

my change to the prevalent dictionary spelling may be controversial.

Let it be.

Change a spelling when a common modern usage is almost a Sanskrit word

Will Dhaval agree? Maybe add the data in the metadata, not changing the entry itself? How many such cases, Jim?

should increase the utility of the digital forms we are working on

For sure.

Offtopic: @Shalu411 is asking where is devanagari gone (and it was green). I guess it's a byproduct of the convertion. ddddd

funderburkjim commented 7 years ago

where did devanagari go?

I confirm that this is a bug. Will investigate in a day or two when meta-line conversion done.

@Shalu411 Hi! Thanks for pointing out this problem.

funderburkjim commented 7 years ago

Regarding Latin dictionary links: There may be an Enchant Latin dictionary -- I haven't had time to investigate. If so, this would be better, since the links mentioned above for Latin appear to require screen-scraping.

Enchant may be related to the Abby OCR program.
We need someone who knows how to make a dictionary in ISPELL format to apply this knowledge to our Sanskrit dictionaries: namely to make a Sanskrit dictionary in ISPELL format. This would open up the possibility of using existing knowledge about European language spelling checkers to make a Sanskrit spell checker.
Since Enchant also has provisions for prefixes and suffixes, we might have a dictionary so encoded to have inflected forms.

gasyoun commented 7 years ago

Since Enchant also has provisions for prefixes and suffixes, we might have a dictionary so encoded to have inflected forms.

Dhaval and I have experimented in the past, guess not that easy as you may think of it. So it's a whole new big task, but indeed - a needed one.

funderburkjim commented 7 years ago

a whole new big task.

I've tried before to get an understanding of the the ISPELL format, without success, so realize the task is non-trivial. Probably best to defer it -- Wonder if @vvasuki has any knowledge on this format.

funderburkjim commented 7 years ago

The formatting problem with AP90 should now be solved.

vvasuki commented 7 years ago

@Shreeshrii is something of a hunspell expert, and has produced a spell checker for sanskrit and hindI. She would be interested in consuming this.

Shreeshrii commented 7 years ago

My efforts at trying to create a hunspell dictionary are documented at https://github.com/Shreeshrii/hindi-hunspell/issues/1

Files are at https://github.com/Shreeshrii/hindi-hunspell/tree/master/Sanskrit

funderburkjim commented 7 years ago

@Shreeshrii Thank you for the links. I've reviewed the material in a preliminary way.

My major question is whether the joining techniques available in Hunspell are adequately expressive for the inflected forms of Sanskrit. Based on my first look, my suspicion is that some of the Sanskrit inflected forms can be adequately expressed with the suffix mechanisms of Hunspell, but that some inflections (nominal as well as verbal) would be challenging to express in the Hunspell system.

Based on your experience, how would you evaluate the fit of Hunspell to Sanskrit?

Shreeshrii commented 7 years ago

First, I am no Hunspell expert :-)

I experimented with Hunspell since it was the only spell-check system that I could get to work somewhat for Hindi/Sanskrit.

Hunspell has both affix and suffix capabilities and from what I can make out, it's development was/has been guided by needs for Hungarian language.

https://github.com/hunspell/hunspell/blob/master/man/hunspell.5 has details about the development of dictionaries for hunspell. There are pdf versions of the same available, I'll have to look for a link.

Shreeshrii commented 7 years ago

Old version of documentation for hunspell is at https://sourceforge.net/projects/hunspell/files/Hunspell/Documentation/

Shreeshrii commented 7 years ago

Attached is the Hunspell dictionary package for Hungarian language - as an example of using the various affix, suffix, compound rules, etc.

hu_HU.zip

funderburkjim commented 7 years ago

@Shreeshrii It seems like it's an open question how much of Sanskrit forms can be shoe-horned into the Hunspell system. Thanks again for the additional links, which should be useful to anyone who tries to extend your experiment.

drdhaval2785 commented 7 years ago

From my Sanskrit grammar knowledge, it seems that hunspell format will fall short on at least four occassions for Sanskrit. Maybe some extension of hunspell will be necessary.

  1. guRa e.g. mud -> modaka
  2. vfddhi e.g. SUra -> SOri
  3. abhyAsa e.g. kf -> cakAra
  4. samprasAraRa e.g. hve -> hUta

These are intraword activities as compared to prefix, suffix rules of hunspell.

But I guess it should be relatively easier to extend Hunspell format for Sanskrit incorporating such extra features.

vvasuki commented 7 years ago

Not to mention sandhi-s and samAsa-s.