Normalizing pada-gana 'spelling' in digitization

funderburkjim commented 9 years ago

This issue was raised by Shalu in an Email. Here is a synopsis of Shalu's comments thus far.

Apte ES shows two standards of entering Verb-root detail (denoting verb-class like 1, 2 etc. and its -"pada" (technically called upagraha- A, P, U) like 8 U, 1 P etc. Some places have 1P, 8U.. without space, but mostly with space. Editors followed strictly the printed dictionary way- ignoring one standard. Either space everywhere/space nowhere -should be chosen strictly. Shall I note it in Wiki pg. proof-reading? For example:

under accommodate:

<br id="00893N"/> <b>2</b> {#vasatisthAnena upakR#} 8U, {#upakaraNadravyANi
  Note: scan no space = digitization no space

<br id="00894N"/> upaklRp#} c <b>3</b> {#saMdhA#} 3U, {#prazam#} c., (dispute
  Note: scan space != digitization no space

<br id="00892N"/> -praviz#} 6 P ({#svAmino bhAvaM bhRtyo'nupravizet.#})
   Note: scan space = digitization space

under accompany:
<br id="00900N"/> {#vraj#} 1 P, {#saM-i, anusR#} 1 P, {#sahacaro bhU#} 1 P.
  Note: scan space = digitization space

Here are some reasons for preferring consistency in the digitization:

It should be consistent because digitizing something is meant for searching- For Eg. if we want to have some information on all dhAtu details Apte has- then we will need two different parameters. One with space- one without. And it is 10 ganas, and 3 pada-s- P/A/U.

Printed book might have the inconsistency because of space issues- lest the word will need to be shifted to next line. We in digitized version have no such limitations. So I think we can rectify even other inconsistencies in the printed book- once we complete the typos and other issues.

funderburkjim commented 9 years ago

General digitizaion principle

Thomas Malten has over the last 20 years or so developed a team in South India (http://www.sanskrit-lexicon.uni-koeln.de/images/Aurorachana_Staff_2006(1).jpg) which does the typing of the digitizations. The overall ideal of a digitization is to accurately reflect in a typed form the scanned images of the text being digitized. The staff does not attempt to interpret what the author meant, or should have meant.
In 'correcting' a digitization, I think we should follow the same principle of adhering to the text as represented in the sanned images of the text.
In the 'earlier' digitizaions, this principle was less clear. For instance, in many of the earlier digitizations the individual lines of the scanned image were not represented directly. Also, Thomas did some undocumented post-processing of some earlier digitizations.
In the case of Monier-Williams, the form of the digitization with which we began (in 2006) was one which had been 'post-processed' some by Thomas. Peter Scharf and I spent considerable time with Thomas trying to understand the 'meta' data that Thomas had added. Then, we (and Malcolm Hyman) worked towards developing an xml markup of the text. This xml markup initially reflected the ad-hoc markup that Thomas had made, but later was extended to include markup of several other features. The end result is the current xml form of MW. It is heavily marked up and generally reflects the text, but is imperfect in its representation of all the details of the text. However, the markup is quite useful in allowing links of the dictionary to allied materials such as abbreviations, literary sources, Whitney's roots, and Westergaard's roots; and generation of inflected forms based on the grammatical specifications in MW.
This heavy markup of MW has been applied by us to none of the other digitizations. Indeed, the working principle of the current forms of the displays and materials derived from the non-MW digitizations has been to provide only very minimal markup. Enough markup is provided to allow search the various dictionaries by headword and linking individual headword information to the corresponding pages of the scanned images.
Also, there is some markup supplied by the typists. Notably, {#...#} indicates Devanagari text coded in the HK transliteration, {@...@} indicates bold text, {%...%} indicates italicized text. This markup is converted simply to an xml form (~~...~~, ..., ...) in the xml form of the digitization, upon which the displays are based. Also, in the earlier editions where Thomas added markup, this is also converted to an xml form. This is where the literary source markup , as in PWG, comes from.
In all the non-MW dictionaries, the 'primary form' is viewed as the digitization from Thomas. The related xml form is a secondary form, used by the displays. Corrections of the non-MW dictionaries applies to the digitizations from Thomas. In the case of MW, the primary form is the xml form (essentially, monier.xml or the recent mw.xml).
In the process of creating an xml form from the digitizations, certain details of the digitization are recognized as errors and have been corrected. For instance, it might be that in the material for a given headword in a digitization, one finds {#... but without a closing #}. This is viewed as an error, and is corrected.

funderburkjim commented 9 years ago

Specific comments regarding the class-pada question in ApteES

1) In accord with the general principle of digitization, the ideal situation would be to have the digitization exactly reflect the scanned image. Applying this to the four illustrative examples would mean that one change is needed; the others need no change as the digitization and scans agree.

<br id="00894N"/> upaklRp#} c <b>3</b> {#saMdhA#} 3U, {#prazam#} c., (dispute
  Note: scan space != digitization no space

 would be changed to

<br id="00894N"/> upaklRp#} c <b>3</b> {#saMdhA#} 3 U, {#prazam#} c., (dispute
  Note: scan space  = digitization no space  (after correction)

2) However, practically speaking, I don't think this kind of correction should have a high priority. It is more important to correct the spellings of the Devanagari words, as Shalu has been doing.

3) In terms of searching, it might be useful to be able to search for all '3 U' (or '3U') instances. But this might better be facilitated by adding markup to the xml file (ae.xml) by some program. rather than searching for text strings '3 U' or '3U'.

For instance, the xml form could have <rp val="3U">3 U</rp> or <rp val="3U">3U</rp>. (rp = root-pada). Thus, the '3U' is normalized in the markup, while the text (3 U, or 3U) is unchanged.

4) There may be a general principle here: To mine all the information in a particular dictionary, and make it accessible for various purposes, it is best to add markup to the xml form of the dictionary.

5) Adding this markup to the xml form is a separate, non-trivial, task. It is non-trivial because it must take into account variations (3 U, or 3U) in the text. This task will require the efforts over many years of many people who are both Sanskritists and programmers. The objective of the 'digitization correction' phase we are in now is to provide digitizations which are as 'clean' as possible which these future contributors can build upon.

gasyoun commented 9 years ago

3 U, or 3U should not make a big difference. I wonder if there are more unusual cases. When you say "best to add markup to the xml form of the dictionary" - would you be ready to implement such dhatu related markup in the official release? Or would it remain for home use only?

funderburkjim commented 9 years ago

We need to focus now on corrections.

Adding markup would be more appropriate once a dictionary's is relatively 'clean.'

There are many of these ideas that are mentioned that need to be deferred.

I am concerned that many good ideas, which cannot be currently addressed adequately, will be lost from our collective memory.

Should there be a separate repository, where unresolved ideas can be mentioned as issues (with reference to issues in other repositories, where they are originally discussed) ?

gasyoun commented 9 years ago

Adding markup would be more appropriate once a dictionary's is relatively 'clean.' - Jim, is it possible at all? The thing mentioned and things upcoming - I do not think that even after five years there will be a file with clean headwords, nut to speak about the rest. It's a never-ending story. Is it not? I guess no. Just having a thread in the main Cologne repository should be enough. Occam's razor.

funderburkjim commented 9 years ago

I was thinking that there we are making good progress with corrections, now. And that we still can see specific tasks (such as the errors caught by Dhaval's pattern approach and your 'fuzzy' approach; and developing a spell-checker for German using the dictionary of Old German that you discovered) that are likely to be productive in detecting and correcting more errors. I have thought about trying an n-gram approach as another way to flag likely Sanskrit spelling errors. Something along these 'spell-checking program' lines ought to be able to help us flag many errors in Apte ES Sanskrit words also. For example, we might analyze the corrections Shalu discovered, and develop some 'typical' error patterns.

So, until we get over the hump of error detection and correction, it seems fruitful to stay focused on that task. As many errors as we can feasibly detect, that will make easier any subsequent endeavors to add helpful markup.

gasyoun commented 9 years ago

Even though some fine steps are behind, I'm sure that until we do not find five more decent Sanskrit scholars, the quest for the mass eliminations of errors will not be over in the next five years. Would love to know your thoughts on the typical' error patterns some statistics would not hurt. So I mean every time is good enough to add markup. Because never in our lifetime there will be a day, when I will be ready to say: done. It's an neverending story. The more I look at them, the more I see. The more errors I find, the more I value the works that other people have done before me.

sanskrit-lexicon / ApteES

Normalizing pada-gana 'spelling' in digitization #1