Open funderburkjim opened 9 years ago
General digitizaion principle
Specific comments regarding the class-pada question in ApteES
1) In accord with the general principle of digitization, the ideal situation would be to have the digitization exactly reflect the scanned image. Applying this to the four illustrative examples would mean that one change is needed; the others need no change as the digitization and scans agree.
<br id="00894N"/> upaklRp#} c <b>3</b> {#saMdhA#} 3U, {#prazam#} c., (dispute
Note: scan space != digitization no space
would be changed to
<br id="00894N"/> upaklRp#} c <b>3</b> {#saMdhA#} 3 U, {#prazam#} c., (dispute
Note: scan space = digitization no space (after correction)
2) However, practically speaking, I don't think this kind of correction should have a high priority. It is more important to correct the spellings of the Devanagari words, as Shalu has been doing.
3) In terms of searching, it might be useful to be able to search for all '3 U' (or '3U') instances. But this might better be facilitated by adding markup to the xml file (ae.xml) by some program. rather than searching for text strings '3 U' or '3U'.
For instance, the xml form could have <rp val="3U">3 U</rp> or <rp val="3U">3U</rp>. (rp = root-pada). Thus, the '3U' is normalized in the markup, while the text (3 U, or 3U) is unchanged.
4) There may be a general principle here: To mine all the information in a particular dictionary, and make it accessible for various purposes, it is best to add markup to the xml form of the dictionary.
5) Adding this markup to the xml form is a separate, non-trivial, task. It is non-trivial because it must take into account variations (3 U, or 3U) in the text. This task will require the efforts over many years of many people who are both Sanskritists and programmers. The objective of the 'digitization correction' phase we are in now is to provide digitizations which are as 'clean' as possible which these future contributors can build upon.
3 U, or 3U should not make a big difference. I wonder if there are more unusual cases. When you say "best to add markup to the xml form of the dictionary" - would you be ready to implement such dhatu related markup in the official release? Or would it remain for home use only?
We need to focus now on corrections.
Adding markup would be more appropriate once a dictionary's is relatively 'clean.'
There are many of these ideas that are mentioned that need to be deferred.
I am concerned that many good ideas, which cannot be currently addressed adequately, will be lost from our collective memory.
Should there be a separate repository, where unresolved ideas can be mentioned as issues (with reference to issues in other repositories, where they are originally discussed) ?
Adding markup would be more appropriate once a dictionary's is relatively 'clean.'
- Jim, is it possible at all? The thing mentioned and things upcoming - I do not think that even after five years there will be a file with clean headwords, nut to speak about the rest. It's a never-ending story. Is it not?
I guess no. Just having a thread in the main Cologne repository should be enough. Occam's razor.
I was thinking that there we are making good progress with corrections, now. And that we still can see specific tasks (such as the errors caught by Dhaval's pattern approach and your 'fuzzy' approach; and developing a spell-checker for German using the dictionary of Old German that you discovered) that are likely to be productive in detecting and correcting more errors. I have thought about trying an n-gram approach as another way to flag likely Sanskrit spelling errors. Something along these 'spell-checking program' lines ought to be able to help us flag many errors in Apte ES Sanskrit words also. For example, we might analyze the corrections Shalu discovered, and develop some 'typical' error patterns.
So, until we get over the hump of error detection and correction, it seems fruitful to stay focused on that task. As many errors as we can feasibly detect, that will make easier any subsequent endeavors to add helpful markup.
Even though some fine steps are behind, I'm sure that until we do not find five more decent Sanskrit scholars, the quest for the mass eliminations of errors will not be over in the next five years. Would love to know your thoughts on the typical' error patterns
some statistics would not hurt. So I mean every time is good enough to add markup. Because never in our lifetime there will be a day, when I will be ready to say: done. It's an neverending story. The more I look at them, the more I see. The more errors I find, the more I value the works that other people have done before me.
This issue was raised by Shalu in an Email. Here is a synopsis of Shalu's comments thus far.
Apte ES shows two standards of entering Verb-root detail (denoting verb-class like 1, 2 etc. and its -"pada" (technically called upagraha- A, P, U) like 8 U, 1 P etc. Some places have 1P, 8U.. without space, but mostly with space. Editors followed strictly the printed dictionary way- ignoring one standard. Either space everywhere/space nowhere -should be chosen strictly. Shall I note it in Wiki pg. proof-reading? For example:
Here are some reasons for preferring consistency in the digitization:
It should be consistent because digitizing something is meant for searching- For Eg. if we want to have some information on all dhAtu details Apte has- then we will need two different parameters. One with space- one without. And it is 10 ganas, and 3 pada-s- P/A/U.
Printed book might have the inconsistency because of space issues- lest the word will need to be shifted to next line. We in digitized version have no such limitations. So I think we can rectify even other inconsistencies in the printed book- once we complete the typos and other issues.