Open funderburkjim opened 10 years ago
CCS file is very dirty, literary hundreds of mistakes. How about it's time to get rid of the initial markup? HK has no advantage above SLP1. For names we can have some ~ symbol, so the capital letter metadata is not totally lost.
Shouldn't we correct the existing CCS digitization first?
If you need an SLP1 version, one option is to use ccs.xml, where Sanskrit has been converted from HK to SLP1. Another option is for me to make an SLP1 version of ccs.txt.
German words are not too hard to identify in CCS; they appear either in italics {%...%} or in no {}. If we had a German wordlist, we could match CCS German against the list.
As to the Sanskrit words, perhaps a variant of Dhaval's approach would catch many errors.
I think it makes sense to correct most of the errors before reformatting and adding markup.
1) "Shouldn't we correct the existing CCS digitization first?" depends on what part are you speaking about. Headwords might be marked by patterns, but Dhaval seems to be tired of checking those patterns in scans, so it will be a list for checking that there is nobody to check for. 2) SLP1 version is wanted. I opened the XML, did not saw it, what do you mean by ccs.xml being SLP1? 3) Can you extract all the German words from {%...%} or in no {} please? 4) "makes sense to correct most of the errors before reformatting and adding markup" - you just do not know how dirty it is :) What do you mean by adding markup? If so, it will take too long. As per now I can't even have an SLP1 headword list. Can I?
Re 2) : I downloaded ccsxml.zip from the CCS download page. Extracting the zip leads to a folder 'xml'. This folder contains several files, among which are ccshw2.txt and ccs.xml.
ccshw2.txt has as first few lines:
001-1:a:7,12
001-1:a:13,32
001-1:aMSa:33,43
001-1:aMSakalpanA:44,49
001-1:aMSaBAj:50,52
001-1:aMSaBUta:53,57
Each line has three fields, separated by a colon ':'
The first field is the scan location. The second field is the headword, coded in slp1, the third field contains the range of lines in ccs.txt that correspond to the headword.
I tried to describe this in the developer documentation.
In ccs.xml, text which the digitization ccs.txt codes as {#X#}, with X in HK, is transformed to <s>Y</s> with Y in SLP1. Also, key1 is in slp1. key2 is also in SLP1 for CCS (but I think this is not always the case for key2 in other dictionaries).
So, for instance,
<H1><h><key1>aMSuka</key1><key2>aMSuka</key2></h><body><s>aMSuka</s>¦ <i>n.</i> Gewand. <s>aMSukAnta</s> <i>m.</i> -zipfel.</body><tail><L>8</L><pc>001-1</pc></tail></H1>
<key1>aMSuka</key1> SLP1
<key2>aMSuka</key2> SLP1
<s>aMSuka</s> SLP1
<s>aMSukAnta</s> SLP1
In CCS scans, I think Sanskrit words always appear in Devanagari. So, for CCS, the above scheme applies to all Sanskrit words.
However, in some dictionaries, Sanskrit words appear in IAST (of which there are several variations). For IAST words in such dictionaries, the words do NOT appear in SLP1 in the xml file, but appear in the AS encoding of IAST that Thomas invented.
The exception to this last rule is MW; in the original MONIER.ALL, most of the IAST was converted to HK, and appears in the current mw.xml as <s>Y</s> with Y in SLP1. Interestingly, in MW72 the IAST is retained in the mw72.txt digitization, I think.
4) Yes, as mentioned, ccshw2.txt provides SLP1 headwords.
4) 'adding markup will take too long' I agree. That's why correcting first seems sensible.
3) German word extraction: If you'll provide a German word list, I'll extract the German words and work with you (since you know German) to get the German word spelling errors corrected.
@funderburkjim as this thread brought some suggestions regarding CCS which were left for future, and we are on CCS now - has the time ripened to handle this thread as well?
@drdhaval2785 going too deep in CCS would do no good. It's a middle-sized dictionary and dirty as none of the others. It's a swamp and half a year would not be enough to get out of it, if all issues addressed. Even compiling a list of German words is a huge task, because of 1) morphological variations 2) old orhtography 3) print errors. Let as leave it at the headwords level, because even it would take 3-5 more years with Sampada alone.
@drdhaval2785 As @gasyoun emphasized, he turns up his nose at CCS. I myself do not know enough to have a similarly low opinion of CCS. And think CCS should be corrected sometime. So I think we should leave this issue open, lest it be forgotten.
Marcis submitted a correction re skanDas (slp1) in CCS. The display of the last line is §§§§§§MISSING. An examination of the scan versus the digitization (or, using the list display), shows that there actually are many headwords missing in the digitization between skanDas and stambin.
From the scan, I typed the headwords in the form needed for the digitization:
If Marcis or someone will complete the above by filling the missing defintions I'll enter them into the digitization. Here are the rules of the CCS digitization: