sanskrit-lexicon / COLOGNE

Development of http://www.sanskrit-lexicon.uni-koeln.de/
18 stars 3 forks source link

KRM meta-line conversion #200

Closed funderburkjim closed 6 years ago

funderburkjim commented 6 years ago

This issue for the meta-line conversion of KRM (Kṛdantarūpamālā).

gasyoun commented 6 years ago

@drdhaval2785 is there an automated way of checking the correctness of Kṛdantarūpamālā's forms based from your verb analysis?

drdhaval2785 commented 6 years ago

No. There is not. Currently only tiNanta forms are generated. Not kfdanta.

gasyoun commented 6 years ago

Not kfdanta

Bad luck.

funderburkjim commented 6 years ago

Footnotes

The conversion is almost complete. The most important change is regarding Footnotes.

The text is organized as a sequence of entries, numbered 1 to 2039 (with a few extra labeld e.g. (41-A)); headwords are roots in dhātu-pāṭha form (i.e., with anubandhas). Each entry consists primarily of a list or table of krdanta forms derived from the root. There are copious footnotes.

To understand the original digitization conventions regarding coding of footnotes and the changes introduced to the new meta-line coding, you need to look at the pages 3 and 4 of the printed text. The first page has first part of entry for 'aka', then a section of footnotes for the page. The second page has the remaining part of aka entry (with two more footnotes), and the beginning of second entry for 'aki', which also has some footnotes indicators, then the bottom of the page has footnotes for the page. Next two comments show scans of these two pages.

funderburkjim commented 6 years ago

Page 3 : aka begins

image

funderburkjim commented 6 years ago

Page 4: aka finishes, and aki begins

image

funderburkjim commented 6 years ago

Original strategy for coding the footnotes

It's difficult to know how to code the footnotes in such a way that the footnotes associated with a particular entry are within the scope of coding of the entry itself. A naive coding would just code the data line-by-line. But then there would be the problem of associating the first two footnotes of page 4 with aka entry.

So instead, Thomas decided to shoe-horn each entire footnote at the location of its mention. Here is how the beginning of 'aka' looks, up through the second line of the table (IAST coding), This is excerpted from the Basic Display before the current conversion:

(1) “aka kuṭilāyāṃ gatau” (ī-bhvādiḥ-792 sakarmakaḥ-seṭ-parasmaipadī) ghaṭādiḥ mit .
‘iditastvaṅkate tatra kuṭilāyāṃ gatāvaket .’ (ślo 41) iti devaḥ .
ṇic- san-
ṇvul ākakaḥ— kikā,
 [Footnote: 1. ‘mitāṃ hrasvaḥ’ (6-4-92) 
iti ṇau upadhāyā hrasvaḥ .] 
akakaḥ— kikā, acikiṣa
 [Footnote: 1A ‘ajāderdvitīyasya’ (6-1-2) iti dvitīya- 
syaikācaḥ dvitvam . ‘kuhoścuḥ’ 
(7-4-62) ityabhyāsasya cutvam .] 
kaḥ— ṣikā;
tṛc (tṛn) akitā-trī, akayitā-trī, acikiṣitā-trī;

While the problem of footnote attachment is clearly solved by this coding, the resulting display grossly distorts the reading of the table of krdantas.

funderburkjim commented 6 years ago

Current strategy

The main idea of the current strategy of coding is to place a footnote marker within the table, and then to collect the corresponding footnotes for the entry at the bottom of the entry. The next comment shows how the total entry for aka looks (snapshot from mobile1 display).

funderburkjim commented 6 years ago

Current display of aka

image

funderburkjim commented 6 years ago

There are a few more comments that need to be made. I'll get to them tomorrow.

gasyoun commented 6 years ago

problem of footnote attachment is clearly solved by this coding, the resulting display grossly distorts the reading of the table of krdantas

Exactly. I'll be off till 24nd February, do not loose me, heading Poona.

funderburkjim commented 6 years ago

Although the changes in footnote coding definitely improve the display of the tabular data within this work, there remain several weaknesses; here are a couple that catch my eye.

multiline tabular entries.

The last entry in the aka table illustrates this phenomenon: image image The underlying digitization uses a tag <note n=""/>` to identify this as a problem area; This is quite common - occurring 700+ times.

table headings vs. data

In the cases of aka, the table has both columnar labels (ṇic- san- ), and row labels (ṇvul , tṛc , etc.) Additional markup is required to distinguish these grammatical labels from the kridanta entries. Such markup would make it possible to develop a search facility whereby a user could determine that , for instance, AkaH is a kridanta of 'aka'.

The aki entry does not similarly show such labels; perhaps the labels are implicit, or perhaps there is some other organizing principle -- situation is unclear to me.

Line breaks

Line breaks are significant in many parts of the text (such as to indicate table rows in aka, aki). In cases where a footnote is the first element in a line, the original footnote coding obscures the fact that a line-break precedes the footnote marker. This happens, for instance, at footnote marker '9' in third entry akṣū. This error can be corrected (by inserting a <div n="lb"> tag prior to the footnote marker <sup>9</sup>).

position of footnote markers within words

The footnote marker occasionally occurs within a kridanta. For instance under 'aka' image This positioning, although consistent with the printed text, obscures the full spelling of the kridanta.
My inclination would be to move such footnote markers to the end or beginning of words.

Additional markup

There is a wealth of information in this text; to expose this information to programmatic manipulation will require the efforts of some team with (a) sufficient technical knowledge of Sanskrit grammar to know how to interpret the details of the text (b) sufficient technical knowledge of markup principles to be able to devise a markup scheme that captures the grammatical information.

These brief observations may provide some hints when further work on this 'dictionary' is undertaken.

funderburkjim commented 6 years ago

Other aspects of the conversion

Headwords uncovered

There are 2061 entries in krm after this work. About 20 of these were previously missed as separate entries due to a variation in the coding.

Correction sections

There are two correction sections in the full krm.txt digitization; these are separate from the entries exposed by the Cologne displays. They are identifIed by text '; BEGIN CORRECTIONS 1' and '; BEGIN CORRECTIONS 2`. These sections occur at pages 1143 and 1427.
Here is beginning of second correction section:

<H><s>SoDanikA</s>
<NI><s>puwam paNktiH aSudDam SudDam</s>
<>501 17 <s>cAyakA cAyakaH</s>   <<< first example

image

It would be a fairly straightforward task to implement these corrections. There are approximately 80 corrections in each of the two sections, or 160 corrections in all. Maybe someone can volunteer to do this.

funderburkjim commented 6 years ago

off till 24nd February, do not loose me, heading Poona.

Will miss your comments.

If you talk to the PD team at Poona, maybe you can ask if they'll give permission for Cologne to display our digitization of their dictionary. This would be a way for there to be a much wider audience for their monumental work.

funderburkjim commented 6 years ago

Specialization of display

The main disp.php program used in the Cologne displays for krm was adapted from the pwg version. A few alterations were required for:

funderburkjim commented 6 years ago

IAST conversion

This was quite simple for krm. In fact, the only IAST text appears in the appendices. The body of the text is all Devanagari and English.

funderburkjim commented 6 years ago

The krm conversion task is now completed and the results installed.

gasyoun commented 6 years ago

Such markup would make it possible to develop a search facility whereby a user could determine that , for instance, AkaH is a kridanta of 'aka'.

Yeah, without it the scan is rather useless.

The aki entry does not similarly show such labels; perhaps the labels are implicit, or perhaps there is some other organizing principle -- situation is unclear to me.

Let's call for @Shalu411 .

My inclination would be to move such footnote markers to the end or beginning of words.

Makes sense. But is it not too big a task not worth the result?

Maybe someone can volunteer to do this.

If only @SergeA is around.

maybe you can ask if they'll give permission for Cologne to display our digitization of their dictionary.

Let me try.

gasyoun commented 5 years ago

Currently only tiNanta forms are generated. Not kfdanta.

https://gitlab.inria.fr/huet/Heritage_Resources/ subdirectory XML as well?