Closed funderburkjim closed 6 years ago
The conversion is now complete and installed. The next few comments document some of the changes that were made.
The steps taken in the conversion are described in this readme.org file. In broad terms, the conversion had two parts:
<L>1<pc>001-1<k1>a<k2>a<h>1
1. {#a#}¦
{%Pron-Stamm
der
3.
Person.%}
<LEND>
Any time a program devised to alter the form of a digitization is applied, that program uncovers one or more irregularities in the digitization. The programs devised to improve pagination markup uncovered a section of one page of text which was missing in the digitization. Adding this missing section had the effect of introducing 28 new headwords.
Since this addition was done to the meta-line form, it was possible to use decimal L-numbers so that the previous dynamically assigned L-numbers remained as they were before the addition.
The addtions are from page 511 of the text, with the first addition being
<L>28521.02<pc>511-1<k1>skanDAvAra<k2>skanDAvAra
and the last addition being
<L>28521.56<pc>511-1<k1>stambamitra<k2>stambamitra
.
You can guess what happened. When the typist finished 'skanDas' in the first column, his or her eye went to the same vertical position in the second column (stambin), thus missing the intervening words at the bottom of the first column and top of the second column.
In the current work, a couple of false headwords were removed (at letter breaks 'r', 'l', etc).
Also one additional missing headword (SatAyus
) was noticed.
Several years ago, also some headword garbling was noticed on another page (228 as I recall).
So we might have some additional missing headwords in this older digitization.
One way to check for a string of missing words would be to generate a character count of each column of each page, and look for anomalies in the distribution.
The printed text of ccs has two columns per page.
The original ccs digitization uses [PagePPP-1]
markup prior to the first text on page PPP; this is
'regular' relative to the other digitizations. However, to indicate the break between the first and
second columns it uses the markup |<UL>
, which is irregular. This irregular markup was changed
to [PagePPP-2]
.
Another oddity occurred when the column break occurs in the middle of an entry. Here is example of entry for headword 'avara', which begins on the last line of first column of page 38, and continues at the top of the second column of page 38.
.{#a/vara#}¦
der
untere,
geringer,
nachstehend, <<< last word first col
|<UL> <<< column break
µ|jünger, <<< first word of second column with gratuitous µ character
näher
The change (at this stage of changes in the conversion ) is to: .{#a/vara#}¦ der untere, geringer, nachstehend, <<< last word first col [Page038-2] <<< column break |jünger, <<< first word of second column with gratuitous µ character removed näher
The vertical bar '|' to indicate line breaks is dealt with in a separate step of the conversion.
As the examples above illustrate, the original form of the ccs digitization generally put each word of the
text on a separate line, and also used the vertical bar character |
to indicate line breaks. Examination of numerous cases indicates that this line break information is unexpectedly complete.
However, the format of one word per line in the digitization is quirky, and I changed this to the more usual form of one line of digitization per one line of text.
The basic programmatic idea is to gather the words in the lines of the original digitization one by one until a | is encountered. The words so gathered comprise a line of the text and become (with space separation) a line of the transformed digitization. There are two cases two consider when a | is encountered.
|jünger
example above. This means that the
new word (junger
) becomes the first word on the next transformed line.vertical bar in the middle of an original line. The first example of this occurs in headword akfta
The original digitization is:
.{#a/kfta#}¦
ungethan,
unbearbeitet,
unvoll|kommen; << vertical bar between unvoll and kommen;
unaufgefordert.
Following the idea introduced in Burnouf digitization, the new coding is
{#a/kfta#}¦ ungethan, unbearbeitet, unvollkommen; <lbinfo n="7"/>
unaufgefordert.
The entire word unvollkommen;
appears without the |, but the information regarding the
position of the line break is encoded in the empty xml tag <lbinfo n="7">
. Since the line break
occurs 7 characters from the end (i.e., before kommen;
), both the presence of an intra-word line
break is asserted, and the position of the break is identified; so no information is lost, and the
text is easier to process (e.g. for searching) since we have the full word as part of the text.
We should consider this technique in other dictionaries (AE comes to mind) where the digitization is coded with the same lines as the printed text, since joining the parts of a word which starts on one line and continues on the next makes that joined word easier to process.
Compared to other dictionaries, the use of Latin letters with diacritics to represent Sanskrit words occurs in a small fraction of the words. After clearing out some 30 or so anomalous cases, only about 700 words are coded using the AS (letter-number) diacritic representation system. And cursory examination confirms that these are Sanskrit words. The C-cedilla is also used in the text for the Latin alphabet equivalent of the palatal sibilant of the Sanskrit alphabet.
The printed text typically uses a circumflex as the diacritic for long vowels; the original AS coding uses a '10' for this circumflex. The current transformation changes this to the macron diacritic.
In this example we have : text = î (circumflex) --> i10 (original digitization) -> ī (macron) current digitization
Similarly, the other AS-encoded diacritics were changed to their modern IAST equivalents.
Also, the c-cedilla is changed to ś
.
In contrast to recent conversions for PWG and similar dictionaries, I made no attempt in this conversion of ccs to present modern IAST where no diacritics are involved. The main case that comes to mind is
the 'sh' representation of the cerebral sibilant -- I didn't try to identify words (whether with or without other AS encoded diacritics) where 'sh' should be changed to modern IAST ṣ
.
This omission is one of the places where this IAST conversion is 'cursory' [see this comment].
When time and interest warrants, someone can complete this detail of conversion to modern IAST.
Only a few other small changes were made to the digitization.
--
with the unicode em dash character —
, which agrees with the text..
used in the original digitization to identify headwords; no need for this
since we have the meta-line to identify headwordsa
is coded as .{#a#}^1
. The current digitization codes as 1. {#a#}
One sign of some progress of convergence of the coding conventions among the dictionaries relates to the main program fragment used in the displays of the dictionaries. This fragment, disp.php, is responsible for converting the xml form of the digitization (e.g., ccs.xml) into html for display in browsers.
In the current dictionary, ccs, I copied the pwg version of disp.php to see if it would work with ccs. And it seems to work just fine. This was a pleasant surprise.
An implication of the 'cursory' nature of this conversion is that some things done with other conversions I intentionally left for another time for ccs.
Probably the most important from the point of view of general user friendliness would be abbreviation markup. There appear to be a relatively small number of frequently used abbreviations. For instance, notice all the juicy abbreviations just in this one entry.
2. {#a,#}¦ {#an#} {#(°—)#} {%negat. Präfix -un, vor Subst.,%}
{%Adj., Adv., Partic. u. Ger.; selten vor%}
{%Inf. und Verb. fin.%}
With likely added abbreviation markup, this entry in ccs.txt might look like
2. {#a,#}¦ {#an#} {#(°—)#} {%<ab>negat.</ab> Präfix -un, vor <ab>Subst.,</ab>%}
{%<ab>Adj.</ab>, <ab>Adv.</ab>, <ab>Partic.</ab> <ab>u.</ab> <ab>Ger.</ab>; selten vor%}
{%<ab>Inf.</ab> und <ab>Verb. fin.</ab>%}
Then an expansion table, quite close to that for PW, could be developed. With these in place, then the display users would have tooltips for the abbreviations.
The trickiest part of this is the correct placement of the abbreviation tags, taking into account vagaries of punctuation, spelling and capitalization. The task is eminently doable, but will take some time to get the details right. That time requirement is why I omitted this markup enchancement now.
ccs has about 30,000 headwords. I am surprised to see that cae has 1/3 more, about 40,000 headwords. Nonetheless, it seems likely that a headword comparison of the two dictionaries by Cappeller might be useful. For instance, I would expect every one of the 30k CCS headwords to be among the 40k headwords of CAE -- and deviations from this would likely turn up spelling corrections among the headwords of one or the other dictionary.
One additional benefit of adding abbreviation markup to ccs is that it would help identify the German words of the dictionary. A list of likely German words could then be compared to a dictionary of German words, such as that available with pyenchant, to identify likely spelling errors among the German words of ccs. The peculiarity of German related to Old and New German (I'm not sure if this is quite the right way to refer to this distinction) might muddy the comparison between the words in this 1887 dictionary and the words in a modern German dictionary.
These are all the descriptive comments that come to mind regarding the recent ccs conversion.
In the current dictionary, ccs, I copied the pwg version of disp.php to see if it would work with ccs. And it seems to work just fine.
A healthy sign for future maintenence indeed. So standardization has started paying off.
It's 2018, that means the 4th year of public discussions has started. And all thanks to Jim. Hurray!
However, to indicate the break between the first and second columns it uses the markup |
, which is irregular. This irregular markup was changed to [PagePPP-2] .
Did not get - now how will I know which column to look at? On web, is it there visible? It's not similar to what we have in Apte, MW or PWG, right?
c-cedilla is changed to ś.
Only here or everywhere you encounter?
pyenchant, to identify likely spelling errors among the German words of ccs
As seen in the past, it works quite well with Old German orthography.
The pagination is now similar to that of other dictionaries. Here's a screenshot for a word in the second column:
c-cedilla is changed to ś everywhere ?
Yes, provided it is a Sanskrit word. I think c-cedilla occurs in some modern languages (French), so if a French word appears in the text of an entry (such as in Burnouf or Stchoupak), then it should not be changed to the ś. That conversion c-cedilla to ś is just intended for the purpose of providing modern IAST spelling to Sanskrit words.
The pagination is now similar to that of other dictionaries.
Perfect.
c-cedilla occurs in some modern languages (French)
Indeed and it was taken from French for the transliteration of Sanskrit some 150 years ago.
Begin meta-line conversion for Cappeller Sanskrit Wörterbuch (cologne dictionary id = ccs).