CCS meta-line/iast conversion

funderburkjim commented 6 years ago

Begin meta-line conversion for Cappeller Sanskrit Wörterbuch (cologne dictionary id = ccs).

funderburkjim commented 6 years ago

The conversion is now complete and installed. The next few comments document some of the changes that were made.

The steps taken in the conversion are described in this readme.org file. In broad terms, the conversion had two parts:

Part 1 convert ccs.txt into 'meta-line' format. This means that certain 'meta' information is computed and made part of the digitization file. This meta information consists of:
- cologne record id (L) to identify each entry
- page-column reference to the printed text serving as a basis for the digitization.
- primary headword spelling (key1)
- original headword spelling (key2)
In terms of form, the digitization at this stages looks like a sequence of entries, each of which has the form: [meta-line] + [almost original digitization] + [end of entry marker]. For instance
```
<L>1<pc>001-1<k1>a<k2>a<h>1
1. {#a#}¦
{%Pron-Stamm
der
3.
Person.%}   
<LEND>
```
Part 2 Adjust various details of the body portion of the entries. The main goal here is to change conventions peculiar to original ccs digitization to conventions appearing in the other dictionaries. Some details are described below.

funderburkjim commented 6 years ago

New headwords

Any time a program devised to alter the form of a digitization is applied, that program uncovers one or more irregularities in the digitization. The programs devised to improve pagination markup uncovered a section of one page of text which was missing in the digitization. Adding this missing section had the effect of introducing 28 new headwords.

Since this addition was done to the meta-line form, it was possible to use decimal L-numbers so that the previous dynamically assigned L-numbers remained as they were before the addition.

The addtions are from page 511 of the text, with the first addition being <L>28521.02<pc>511-1<k1>skanDAvAra<k2>skanDAvAra and the last addition being <L>28521.56<pc>511-1<k1>stambamitra<k2>stambamitra.

You can guess what happened. When the typist finished 'skanDas' in the first column, his or her eye went to the same vertical position in the second column (stambin), thus missing the intervening words at the bottom of the first column and top of the second column.

funderburkjim commented 6 years ago

Other missing headwords?

In the current work, a couple of false headwords were removed (at letter breaks 'r', 'l', etc).

Also one additional missing headword (SatAyus) was noticed.

Several years ago, also some headword garbling was noticed on another page (228 as I recall).

So we might have some additional missing headwords in this older digitization.

One way to check for a string of missing words would be to generate a character count of each column of each page, and look for anomalies in the distribution.

funderburkjim commented 6 years ago

Column breaks

The printed text of ccs has two columns per page. The original ccs digitization uses [PagePPP-1] markup prior to the first text on page PPP; this is 'regular' relative to the other digitizations. However, to indicate the break between the first and second columns it uses the markup |<UL>, which is irregular. This irregular markup was changed to [PagePPP-2] .

Another oddity occurred when the column break occurs in the middle of an entry. Here is example of entry for headword 'avara', which begins on the last line of first column of page 38, and continues at the top of the second column of page 38.

.{#a/vara#}¦
der
untere,
geringer,
nachstehend,   <<< last word first col
|<UL>              <<< column break
µ|jünger,          <<< first word of second column    with gratuitous µ character
näher

The change (at this stage of changes in the conversion ) is to: .{#a/vara#}¦ der untere, geringer, nachstehend, <<< last word first col [Page038-2] <<< column break |jünger, <<< first word of second column with gratuitous µ character removed näher


The vertical bar '|' to indicate line breaks is dealt with in a separate step of the conversion.

funderburkjim commented 6 years ago

Merge lines

As the examples above illustrate, the original form of the ccs digitization generally put each word of the text on a separate line, and also used the vertical bar character | to indicate line breaks. Examination of numerous cases indicates that this line break information is unexpectedly complete.

However, the format of one word per line in the digitization is quirky, and I changed this to the more usual form of one line of digitization per one line of text.

The basic programmatic idea is to gather the words in the lines of the original digitization one by one until a | is encountered. The words so gathered comprise a line of the text and become (with space separation) a line of the transformed digitization. There are two cases two consider when a | is encountered.

vertical bar at the beginning of a line. See |jünger example above. This means that the new word (junger) becomes the first word on the next transformed line.
vertical bar in the middle of an original line. The first example of this occurs in headword akfta The original digitization is:
```
.{#a/kfta#}¦
ungethan,
unbearbeitet,
unvoll|kommen;   << vertical bar between unvoll and kommen;
unaufgefordert.
```
Following the idea introduced in Burnouf digitization, the new coding is
```
{#a/kfta#}¦ ungethan, unbearbeitet, unvollkommen; <lbinfo n="7"/>
unaufgefordert.
```
The entire word unvollkommen; appears without the |, but the information regarding the position of the line break is encoded in the empty xml tag <lbinfo n="7">. Since the line break occurs 7 characters from the end (i.e., before kommen;), both the presence of an intra-word line break is asserted, and the position of the break is identified; so no information is lost, and the text is easier to process (e.g. for searching) since we have the full word as part of the text.

We should consider this technique in other dictionaries (AE comes to mind) where the digitization is coded with the same lines as the printed text, since joining the parts of a word which starts on one line and continues on the next makes that joined word easier to process.

funderburkjim commented 6 years ago

Conversion of non-ascii characters to unicode.

Compared to other dictionaries, the use of Latin letters with diacritics to represent Sanskrit words occurs in a small fraction of the words. After clearing out some 30 or so anomalous cases, only about 700 words are coded using the AS (letter-number) diacritic representation system. And cursory examination confirms that these are Sanskrit words. The C-cedilla is also used in the text for the Latin alphabet equivalent of the palatal sibilant of the Sanskrit alphabet.

The printed text typically uses a circumflex as the diacritic for long vowels; the original AS coding uses a '10' for this circumflex. The current transformation changes this to the macron diacritic.

In this example we have : text = î (circumflex) --> i10 (original digitization) -> ī (macron) current digitization

Similarly, the other AS-encoded diacritics were changed to their modern IAST equivalents. Also, the c-cedilla is changed to ś.

Conversion omissions

In contrast to recent conversions for PWG and similar dictionaries, I made no attempt in this conversion of ccs to present modern IAST where no diacritics are involved. The main case that comes to mind is the 'sh' representation of the cerebral sibilant -- I didn't try to identify words (whether with or without other AS encoded diacritics) where 'sh' should be changed to modern IAST ṣ.

This omission is one of the places where this IAST conversion is 'cursory' [see this comment].

When time and interest warrants, someone can complete this detail of conversion to modern IAST.

funderburkjim commented 6 years ago

Miscellaneous changes

Only a few other small changes were made to the digitization.

replace the double-dash -- with the unicode em dash character —, which agrees with the text.
remove the initial period . used in the original digitization to identify headwords; no need for this since we have the meta-line to identify headwords
change the coding of homonyms to follow the form of printed text. For example, the first homonym of the headwords a is coded as .{#a#}^1. The current digitization codes as 1. {#a#}

funderburkjim commented 6 years ago

displays based on PWG disp.php

One sign of some progress of convergence of the coding conventions among the dictionaries relates to the main program fragment used in the displays of the dictionaries. This fragment, disp.php, is responsible for converting the xml form of the digitization (e.g., ccs.xml) into html for display in browsers.

In the current dictionary, ccs, I copied the pwg version of disp.php to see if it would work with ccs. And it seems to work just fine. This was a pleasant surprise.

funderburkjim commented 6 years ago

TODO

An implication of the 'cursory' nature of this conversion is that some things done with other conversions I intentionally left for another time for ccs.

Abbreviations

Probably the most important from the point of view of general user friendliness would be abbreviation markup. There appear to be a relatively small number of frequently used abbreviations. For instance, notice all the juicy abbreviations just in this one entry.

2. {#a,#}¦ {#an#} {#(°—)#} {%negat. Präfix -un, vor Subst.,%}
{%Adj., Adv., Partic. u. Ger.; selten vor%}
{%Inf. und Verb. fin.%}

With likely added abbreviation markup, this entry in ccs.txt might look like

2. {#a,#}¦ {#an#} {#(°—)#} {%<ab>negat.</ab> Präfix -un, vor <ab>Subst.,</ab>%}
{%<ab>Adj.</ab>, <ab>Adv.</ab>, <ab>Partic.</ab> <ab>u.</ab> <ab>Ger.</ab>; selten vor%}
{%<ab>Inf.</ab> und <ab>Verb. fin.</ab>%}

Then an expansion table, quite close to that for PW, could be developed. With these in place, then the display users would have tooltips for the abbreviations.

The trickiest part of this is the correct placement of the abbreviation tags, taking into account vagaries of punctuation, spelling and capitalization. The task is eminently doable, but will take some time to get the details right. That time requirement is why I omitted this markup enchancement now.

Compare to cae

ccs has about 30,000 headwords. I am surprised to see that cae has 1/3 more, about 40,000 headwords. Nonetheless, it seems likely that a headword comparison of the two dictionaries by Cappeller might be useful. For instance, I would expect every one of the 30k CCS headwords to be among the 40k headwords of CAE -- and deviations from this would likely turn up spelling corrections among the headwords of one or the other dictionary.

use pyenchant to spell-check German in ccs

One additional benefit of adding abbreviation markup to ccs is that it would help identify the German words of the dictionary. A list of likely German words could then be compared to a dictionary of German words, such as that available with pyenchant, to identify likely spelling errors among the German words of ccs. The peculiarity of German related to Old and New German (I'm not sure if this is quite the right way to refer to this distinction) might muddy the comparison between the words in this 1887 dictionary and the words in a modern German dictionary.

funderburkjim commented 6 years ago

These are all the descriptive comments that come to mind regarding the recent ccs conversion.

drdhaval2785 commented 6 years ago

In the current dictionary, ccs, I copied the pwg version of disp.php to see if it would work with ccs. And it seems to work just fine.

A healthy sign for future maintenence indeed. So standardization has started paying off.

gasyoun commented 6 years ago

It's 2018, that means the 4th year of public discussions has started. And all thanks to Jim. Hurray!

However, to indicate the break between the first and second columns it uses the markup |
, which is irregular. This irregular markup was changed to [PagePPP-2] .

Did not get - now how will I know which column to look at? On web, is it there visible? It's not similar to what we have in Apte, MW or PWG, right?

c-cedilla is changed to ś.

Only here or everywhere you encounter?

pyenchant, to identify likely spelling errors among the German words of ccs

As seen in the past, it works quite well with Old German orthography.

funderburkjim commented 6 years ago

The pagination is now similar to that of other dictionaries. Here's a screenshot for a word in the second column:

funderburkjim commented 6 years ago

c-cedilla is changed to ś everywhere ?

Yes, provided it is a Sanskrit word. I think c-cedilla occurs in some modern languages (French), so if a French word appears in the text of an entry (such as in Burnouf or Stchoupak), then it should not be changed to the ś. That conversion c-cedilla to ś is just intended for the purpose of providing modern IAST spelling to Sanskrit words.

gasyoun commented 6 years ago

The pagination is now similar to that of other dictionaries.

Perfect.

c-cedilla occurs in some modern languages (French)

Indeed and it was taken from French for the transliteration of Sanskrit some 150 years ago.

sanskrit-lexicon / COLOGNE