Closed funderburkjim closed 7 years ago
AP90's intrinsic IAST coding uses in part italicized letters
Similar to MD, and that's the worst possible transliteration.
chh -> ch
We do have a list of what was changed to what in which dictionary, right, Jim?
Informal review of the text suggests that most of the IAST-Sanskrit words are capitalized. This work limited the analysis to just this class of words. There are roughly 25-30,000 words starting with a upper-case letter from the English alphabet.
The AP90 digitization respects line breaks appearing in the text. About 15% of the 200,000 lines of the digitization have a '-' at the end of the line, indicating a hyphenated word. For this analysis, such hyphenated words were excluded.
There are many Devanagari words or phrases embedded within the entries of AP90; these are coded with SLP1 which uses both upper-case and lower case Latin alphabet letters. These transliterated Devanagari words must of course be side-stepped.
'ch' is the modern IAST form for the letter whose SLP1 coding is 'C' (aspirated hard palatal).
In IAST conversion, I'm aiming for consistency in the result, so that subsequent identification of Sanskrit words represented in IAST will have a firm basis in the IAST spelling.
In other words, I'm more interested in the result than in the wildly varying forms appearing in the original digitizations. E.g., who cares that Wilson used "ch'h" and AP90 used "chh" for the modern 'ch', and that both used 'ch' for the modern 'c'.
Only words with at least 3 characters are considered.
The above points suggest how words in each line of the digitization may be selected with Regular expressions. As mentioned, there would be roughly 30,000 such words.
The large majority of these Capitalized words are English (e.g. words at the beginning of an English sentence). Also, some of these words are Latin (plant or animal genus-species), anglicized spelling of words in one of the modern Indian languages, abbreviations of literary sources, etc.
Thus, the initial filtering excludes words identified as non-Sanskrit. This identification is specifically done by generating a file of non-Sanskrit words. This is done in part with consulting the Enchant English dictionary, in part by including the non-Sanskrit words developed in a similar IAST conversion of the Wilson dictionary. And finally, by an iterative process of contextual review. This file is called 'english.txt', since it mostly consists of English words.
The initial number of distinct non-excluded words is on the order of 2000, and these words occur in approximately 12000 lines of the digitization.
Many of the changes must be made within a very limited context, in order to avoid unwanted changes that would be present in a broader context. For instance, we want to change 'sh' to 'ṣ' in Vishṇu
, but
we don't want to make this change within the English word 'Worship'.
The way the current methodology controls the context is by making an intermediate version of the digitization lines which contain one of the filtered words. An example may clarify.
For line 806:
<>of Vishṇu, the first of the three PREVIOUS CODING
<>of {_Vishṇu_}, the first of the three INTERMEDIATE CODING. The word is marked by {_..._}
For line 77653:
<>per of Gaṇeśa. {#--tyaM#} {@1@} Worship of PREVIOUS CODING
<>per of {_Gaṇeśa_}. {#--tyaM#} {@1@} Worship of INTERMEDIATE CODING
Note Gaṇeśa is marked for further examination (actually it is correct and will require no change)
Note Worship is NOT marked for further examination -- it has been identified as non-Sanskrit
from the english.txt file
For line 28553:
<>or stars in the Ursa Major {#--darSa-#} PREVIOUS CODING
<>or stars in the {_Ursa_} Major {#--darSa-#} INTERMEDIATE CODING
This is relative to an early version of english.txt, that didn't know that Ursa was a Latin
word. In the final version of english.txt, Ursa *was* included.
Keep this example in mind as the further steps of the methodology are described.
temp_allchanges.txt shows all the changes (3022) made in the process of completing the conversion to modern IAST. But, what is the process by which these change transactions were generated ?
The filtering identifies a large (2647) collection of words that potentially need to be converted to IAST. Refer to analysis_nonenglish.txt (This is another file in the same Gist).
In this file we see several pieces of information for each word. Consider the 3rd record for explanation of the fields:
12 Abhimanyu Abhimanyu abhimanyu aBimanyu OK
12 = # of instances
Abhimanyu = The original spelling of the word in the digitization
Abhimanyu = The modern IAST spelling of this word. No difference for this word
This uses a particular transcoding: ap90as_roman.xml
abhimanyu = lower-case of modern IAST spelling -- needed for next form
aBimanyu = SLP1 form of previous.
This uses a second transcoding: romanuni_slp1.xml
OK = status. The 'OK' means that the SLP1 spelling was recognized as a Sanskrit word.
There are two ways this can happen:
- word is a headword in mw or ap90
- word is recognized as a Sanskrit word by some other means.
A section of the program generating the file
(namely, the program analysis_apostrophe.py) has a list of these.
For instance in the 1st record abDijO was classed as OK since
this word is the m. dual nominative form of headword abDija)
In addition to the OK/TODO values of the status, another value seen in OK-NONSAN. as in
2 Behār Behār behār behAr OK-NONSAN
This came from previous work with Wilson Dictionary, in which 'behAr' was posted as
one of a list of non-sanskrit words (in this case a place name).
Words classified as OK (or OK-NONSAN) are considered to be solved, in the sense that we have an explanation for the IAST spelling.
There are 1126 words whose classification began as TODO -- we consider these cases as unsolved at present. We don't know whether they are
The rest of the work is to resolve these TODO cases.
Quite a few of thee words are identified by eye as English words, that for some reason we missed in our use of Enchant. E.g. Wednesday, Wifehood, Williams, Adding these words to english.txt will remove them from analysis_nonenglish.txt in the next iteration.
The Wilson IAST conversion collected numerous Latin words, mostly present in genus-species taxonomy. Some of these also appear in AP90. So adding them to english.txt removes some more from analysis_nonenglish.txt.
Making these additions to english.txt, and rerunning, the analysis_nonenglish.txt file now has 2003 cases, and 482 are classified as TODO. The revised analysis file appears as analysis_nonenglish1.txt in the same Gist.
The comparitively easy part is over. From here on, the remaining TODO cases have to be examined one by one, and some resolution made. This resolution might be:
... a mountain in the west of India (Abu)...
. A Google search 'mountain abu' confirms that
Abu is the modern name of a certain mountain. Thus, we classify Abu as a non-Sanskrit word, and
add this word to english.txt to exclude it from further iterations.
Similar resolutions occur for words that are:
Add the word to a list of known Sanskrit words, that are not found as headwords. This was mentioned with the abDijO example above.
When all these changes have been made, then the resulting corrected IAST spellings of Sanskrit words are recognized as Sanskrit words, and the non-Sanskrit words are also handled properly. So the process is completed. Again, the total set of change transactions is available in temp_allchanges.txt in the Gist.
With these changes made, the meta-line conversion of #158 can be resumed.
A similar kind of analysis could be done for some of the excluded cases. One category would be hyphenated words. There are 4895 lines in the digitization where a capitalized word ends with a hyphen at the end of a line. Probably some of these need adjustment.
There may be a need for analyzing words beginning with a lower-case letter, or there may be no such cases -- I'm not sure but suspect the latter (no cases).
As the above description indicates, the conversion to modern IAST requires a greater in-depth analysis than one might at first anticipate. Nonetheless, it seems to me that such a conversion is worth the trouble.
When I first began to study Sanskrit, and to use dictionaries, I recall that the various spellings (some in Devanagari, some using Latin letters with diacritics) were quite confusing. One source of the confusion when trying to correlate the Devanagari and Latin-letter spellings was the irregularity of the Latin-letter spellings. If the Cologne versions of the various dictionaries use a common IAST spelling in place of the various IAST conventions of the original dictionaries, then this should increase the utility of the digital forms we are working on.
In other words, I'm more interested in the result than in the wildly varying forms appearing in the original digitizations.
Agree. Enough of wildlife.
who cares that Wilson used "ch'h" and AP90 used "chh" for the modern 'ch', and that both used 'ch' for the modern 'c'.
I care, but only to document the temperature in the clinic.
This is relative to an early version of english.txt, that didn't know that Ursa was a Latin word. In the final version of english.txt, Ursa was included.
How about adding a Latin word list https://www.math.ubc.ca/~cass/frivs/latin/latin-dict-full.html ? Additional to English. Or rather http://latin-dictionary.net/list/letter/a
my change to the prevalent dictionary spelling may be controversial.
Let it be.
Change a spelling when a common modern usage is almost a Sanskrit word
Will Dhaval agree? Maybe add the data in the metadata, not changing the entry itself? How many such cases, Jim?
should increase the utility of the digital forms we are working on
For sure.
Offtopic: @Shalu411 is asking where is devanagari gone (and it was green). I guess it's a byproduct of the convertion.
where did devanagari go?
I confirm that this is a bug. Will investigate in a day or two when meta-line conversion done.
@Shalu411 Hi! Thanks for pointing out this problem.
Regarding Latin dictionary links: There may be an Enchant Latin dictionary -- I haven't had time to investigate. If so, this would be better, since the links mentioned above for Latin appear to require screen-scraping.
Enchant may be related to the Abby OCR program.
We need someone who knows how to make a dictionary in ISPELL format to apply this knowledge
to our Sanskrit dictionaries: namely to make a Sanskrit dictionary in ISPELL format. This would open
up the possibility of using existing knowledge about European language spelling checkers to make
a Sanskrit spell checker.
Since Enchant also has provisions for prefixes and suffixes, we might
have a dictionary so encoded to have inflected forms.
Since Enchant also has provisions for prefixes and suffixes, we might have a dictionary so encoded to have inflected forms.
Dhaval and I have experimented in the past, guess not that easy as you may think of it. So it's a whole new big task, but indeed - a needed one.
a whole new big task.
I've tried before to get an understanding of the the ISPELL format, without success, so realize the task is non-trivial. Probably best to defer it -- Wonder if @vvasuki has any knowledge on this format.
The formatting problem with AP90 should now be solved.
@Shreeshrii is something of a hunspell expert, and has produced a spell checker for sanskrit and hindI. She would be interested in consuming this.
My efforts at trying to create a hunspell dictionary are documented at https://github.com/Shreeshrii/hindi-hunspell/issues/1
Files are at https://github.com/Shreeshrii/hindi-hunspell/tree/master/Sanskrit
@Shreeshrii Thank you for the links. I've reviewed the material in a preliminary way.
My major question is whether the joining techniques available in Hunspell are adequately expressive for the inflected forms of Sanskrit. Based on my first look, my suspicion is that some of the Sanskrit inflected forms can be adequately expressed with the suffix mechanisms of Hunspell, but that some inflections (nominal as well as verbal) would be challenging to express in the Hunspell system.
Based on your experience, how would you evaluate the fit of Hunspell to Sanskrit?
First, I am no Hunspell expert :-)
I experimented with Hunspell since it was the only spell-check system that I could get to work somewhat for Hindi/Sanskrit.
Hunspell has both affix and suffix capabilities and from what I can make out, it's development was/has been guided by needs for Hungarian language.
https://github.com/hunspell/hunspell/blob/master/man/hunspell.5 has details about the development of dictionaries for hunspell. There are pdf versions of the same available, I'll have to look for a link.
Old version of documentation for hunspell is at https://sourceforge.net/projects/hunspell/files/Hunspell/Documentation/
Attached is the Hunspell dictionary package for Hungarian language - as an example of using the various affix, suffix, compound rules, etc.
@Shreeshrii It seems like it's an open question how much of Sanskrit forms can be shoe-horned into the Hunspell system. Thanks again for the additional links, which should be useful to anyone who tries to extend your experiment.
From my Sanskrit grammar knowledge, it seems that hunspell format will fall short on at least four occassions for Sanskrit. Maybe some extension of hunspell will be necessary.
These are intraword activities as compared to prefix, suffix rules of hunspell.
But I guess it should be relatively easier to extend Hunspell format for Sanskrit incorporating such extra features.
Not to mention sandhi-s and samAsa-s.
Recall that AP90's intrinsic IAST coding uses in part italicized letters. While the previous conversion to IAST took this into account, it was discovered that the previous approach missed numerous cases. These include
A significant attempt was made to remedy these problems.
The end result was that 3022 lines of the digitization were changed; these may be reviewed in temp_allchanges.txt