Converting Sanskrit in MW72 from AS to slp1

funderburkjim commented 8 years ago

This issue is opened to deal with the request made elsewhere.

The problem is that most Sanskrit in MW72 is represented only in a particular variant of IAST, and this IAST is coded in the digitization using the AS (Anglicized Sanskrit, number-letter) form. This form is hard to work with.

For comparison with other Sanskrit spellings, it would be useful to have an slp1 form computed.

The discussion of this issue is to decide exactly how to accomplish this goal.

gasyoun commented 8 years ago

a particular variant of IAST

For the sake of comparison and sanhw1 I would make it standard IAST.

funderburkjim commented 8 years ago

If we want to replace the AS-Sanskrit coding with a more useful coding in MW72, then it seems necessary to untangle the languages within this dictionary, at least to the extent of identifying the italicized text as Sanskrit or non-Sanskrit.
[This in response to this comment ]

funderburkjim commented 7 years ago

A summary of proposed changes to mw72.txt digitization

This comment summarizes the detailed notes of the 20161107 directory. Except for numerous corrections to the current digitization (made for the purpose of facilitating this work), the work has not yet been made part of the installed Cologne materials. I wanted to get reaction of others first.

The main result of this work is mw72adj1_roman2.txt, which I propose as the new mw72.txt.

A significant ancillary result is italics4_slp.txt, which contains the slp coding for all italicized Sanskrit text of the digitization.

mw72adj1_roman2.txt

In this version of the digitization, all the AS (Anglicized Sanskrit, letter-number) coding has been replaced by Unicode characters. Here is an example:

old (current) mw72.txt
<P>{%Kiri1t2in, i1, ini1, i,%} decorated with a diadem, crested,
<>crowned; ({%i1%}), m. a king; an epithet of Indra; one
<>of the attendants of S4iva; a N. of Arjuna.
-----
<P>{%C4andra, as, a1, am%} (originally {%s4c4andra;%} cf. {%as4va-
<>s4c4andra, puru-s4c4º,%} &c.), Ved. glittering, shining
<>(as gold), having the brilliancy or hue of light;

mw72adj1_roman2
<P>{%Kirīṭin, ī, inī, i,%} decorated with a diadem, crested,
<>crowned; ({%ī%}), m. a king; an epithet of Indra; one
<>of the attendants of Śiva; a N. of Arjuna.
----
<P>{%Ćandra, as, ā, am%} (originally {%śćandra;%} cf. {%aśva-%}
<>{%śćandra, puru-śćº,%} &c.), Ved. glittering, shining
<>(as gold), having the brilliancy or hue of light;

Note that there are some differences between the Roman unicode of mw72adj1_roman2 and the coding system considered to be current standard IAST. The exact comparison is shown in the file mwiast-iast.txt

Some reasons for maintaining these differences are:

To maintain information equivalence between the proposed version of the digitization and the original version
To facilitate comparisons between the digitization and the printed text. The roman2 version is quite close to the printed text.

funderburkjim commented 7 years ago

italics4_slp.txt

To understand what is in this file, consider part of the digitization under headword aMSa: Here is the printed text image:

current digitization (lines 1683-1690)
<P>{%An6s4a, as,%} m. a share, portion, part, party;
<>partition, inheritance; a share of booty; earnest
<>money; a fraction; the denominator of one; a
<>degree of lat. or long.; N. of an A1ditya; the
<>shoulder or shoulder-blade, more usually spelt {%an6sa,%}
<>q. v. [cf. Old Germ. {%ahsala;%} Mod. Germ. {%achsel;%}
<>Lat. {%axilla.%}]. {%--An6s4a-karan2a, am,%} n. act of dividing.
<>{%--An6s4a-bha1j, k, k, k,%} one who has a share, an heir, a

Proposed digitization:
<P>{%Aṉśa, as,%} m. a share, portion, part, party;
<>partition, inheritance; a share of booty; earnest
<>money; a fraction; the denominator of one; a
<>degree of lat. or long.; N. of an Āditya; the
<>shoulder or shoulder-blade, more usually spelt {%aṉsa,%}
<>q. v. [cf. Old Germ. <nsi>ahsala;</nsi> Mod. Germ. <nsi>achsel;</nsi>
<>Lat. <nsi>axilla.</nsi>]. {%--Aṉśa-karaṇa, am,%} n. act of dividing.
<>{%--Aṉśa-bhāj, k, k, k,%} one who has a share, an heir, a

And here is the corresponding part of the italics4 files:

italics4_roman2.txt
9@aMSa@1683@hw2@Aṉśa, as,
9@aMSa@1687@@aṉsa,
9@aMSa@1688@Germ@ahsala;
9@aMSa@1688@Germ@achsel;
9@aMSa@1689@Lat@axilla.
9@aMSa@1689@@--Aṉśa-karaṇa, am,
9@aMSa@1690@@--Aṉśa-bhāj, k, k, k,

italics4_slp.txt : only Sanskrit, last field is slp1 version of prior roman2 field.
9@aMSa@1683@hw2@Aṉśa, as,@aMSa, as,
9@aMSa@1687@@aṉsa,@aMsa,
9@aMSa@1689@@--Aṉśa-karaṇa, am,@--aMSa-karaRa, am,
9@aMSa@1690@@--Aṉśa-bhāj, k, k, k,@--aMSa-BAj, k, k, k,

Note that the two italics files have each snippet of italicized text appearing in the range of lines considered; however, the slp file does not include the snippets that are non-Sanskrit.

This italics4_slp file can be used as the basis for further investigations, such as

spelling checks (e.g., by looking for odd n-grams in comparison to those of hwnorm1)
identification of sub-headwords (identifiable by the '--X' pattern,

There are several reasons that it seems currently impractical to embed these slp codings of Sanskrit into the digitization.

Capitalization. The roman coding in mw72 has capitalization, a feature of European languages that is absent in Sanskrit. In addition to the rather regular capitalization of the above examples, there are about 150 cases of capitalization of Sanskrit that do not follow this pattern.
non-italic Sanskrit. (For instance, Āditya). The present work has focused entirely on italicized text, and has identified sanskrit/non-sanskrit italic text. Identifying non-italic Sanskrit text would require quite different techniques.

funderburkjim commented 7 years ago

I hope others will give some thought to this work. After some time to consider comments, my current intent is to have the proposed digitization become the base-line mw72.txt, and to make related changes to the headword-generation and web display programs.

gasyoun commented 7 years ago

about 150 cases of capitalization of Sanskrit that do not follow this pattern

List?

Identifying non-italic Sanskrit text would require quite different techniques.

Let's forget about it.

make related changes to the headword-generation and web display programs

They will only improve. I'm not aware of nobody who is using MW72 on a regular basis. I know about 200 people who use MW regularly. It's interesting as part of the history, but not as a reference book, so you should not waste too much of your valuable time on it, Jim. That's my opinion.

funderburkjim commented 7 years ago

Here is the list of cases of capitalization of Sanskrit that do not follow the sub-headword pattern of --<Cap>....

funderburkjim commented 7 years ago

The changes to mw72 (replacement of AS coding with Roman Unicode coding) mentioned above have been completed. No more AS to worry about.

gasyoun commented 7 years ago

No more AS to worry about.

Hurray! Only here or that was the last one?

funderburkjim commented 7 years ago

@gasyoun Ideally, the same conversion from AS to Roman unicode would be tackled for all the dictionaries.

It would be useful to investigate the X-meta.txt files for the other dictionaries with regard to their use of AS, so as to have an idea of the magnitude of such a conversion task.

If you would be willing to do such an investigation, I could get you a zip file with all the X-meta files.

gasyoun commented 7 years ago

It would be useful to investigate the X-meta.txt files for the other dictionaries with regard to their use of AS, so as to have an idea of the magnitude of such a conversion task.

Hmm, to be fair first I want clean headwords. All headwords are clean now. The text is not what I'm interested in, only occasional cleaning for words I quote for my students, like graha. Because if I'll start I will not be able to end. I could look to see if one can compile one list for all? Maybe no need for different encoding files. I've researched the AS for 3 years now. Can't promise, but could try to take a look in 2017. How does it sound, Jim?

grahaa

funderburkjim commented 7 years ago

take a look in 2017 ?

Sure. Any time you're up for helping is a good time.

gasyoun commented 7 years ago

No change needed
Change: it's a print error
Change: it's a typo (digitization error)

What if not all green when checked, but different colors? @drdhaval2785 makes sense?

purururu

In longer articles like Puru-nisḥshidh I have to scroll the middle of the article to find (line # 133801). What if the link would be duplicated in one static place, so I could see and click it right away? What if the word in question would be in a div above the pdf page? For the last so bad my JS coder does not reply, than I would have no need to see the starting page, only the pdf itself.

http://www.sanskrit-lexicon.uni-koeln.de/scans/MW72Scan/2014/pywork/correctionwork/issue-320/02/update.php done, @funderburkjim

funderburkjim commented 7 years ago

A couple of changes to UI. See this batch 301 .

added the 'word in question' as part of word list. So wordlist shows:
- sequence number within batch
- headword where the potential error occurs
- the word to be examined for correction
In the case detail, the display is limited to a window of about 15 lines around the line containing the correction line. Thus, user won't need to scroll to see the (line #xxx) link.

Improvement?

drdhaval2785 commented 7 years ago

added the 'word in question' as part of word list.

Keep the link over whole keyword (questionable-word) pair. I kept on clicking on questionable-word.

gasyoun commented 7 years ago

limited to a window of about 15 lines

Let's have 13, to eliminate the scrolling in the window. Small, yet still there.

the word to be examined for correction

And in the original transliteration. Can we have IAST on the page as well? The nowadays one.

Improvement?

A lot!

funderburkjim commented 7 years ago

Have revised UI as follows:

Keep the link over whole keyword (questionable-word) pair: Done
Let's have 13 : Done
Also, added the MW-iast alphabet to the right, so user may copy-paste the non-ascii characters.

Re Can we have IAST on the page as well? The nowadays one Didn't do this. Since the digitization is coded close to the printed version, I think it would be confusing for corrections to show modern IAST. I'm hoping the MW-IAST table will limit confusion and make corrections involving non-ascii letters easier to enter.

gasyoun commented 7 years ago

I think it would be confusing for corrections to show modern IAST

For example for Sergey SLP1 is confusing. So I still would vote to have IAST. Anywhere on the page, I guess we have enough screen property.

ganga

Let's have 13 : Done

Scrolling gone, great.

Also, added the MW-iast alphabet to the right, so user may copy-paste the non-ascii characters.

Well done.

sanskrit-lexicon / MW72