sanskrit-lexicon / COLOGNE

Development of http://www.sanskrit-lexicon.uni-koeln.de/
18 stars 3 forks source link

Headword spelling in Sabda-kalpadruma #36

Closed funderburkjim closed 7 years ago

funderburkjim commented 10 years ago

The corrections submitted for SKD are now processed, and it won't take so long to process further corrections. Two details arose in considering these corrections, and this 'sanskrit-lexicon/Cologne/Issues' list seems a reasonable place to mention these details.

  1. A question was raised whether an ending visarga should be part of the headword. For example ghaTotkaca. The text clearly has a visarga. In the 'definition' part of the displays, this visarga is still present. However, the spelling used for looking up the word does NOT have the visarga. This was done intentionally, so nominal headwords would be spelled using the 'stem' rather than the first-singular form. This is the headword convention in MW. Similarly, neuter nominals appear with the ending 'm' or 'M', but this letter is not used in the lookup key.
  2. When examining the patnI correction with the List display, I noticed that the second word after patnI had the key spelled as 'pattra'. The text shows 'pa(tra)ttraM'. The 'pattra' spelling bothered me, because it is not in alphabetical order. So, I changed the spelling of the key to 'patra', which does place this word in alphabetical order.
gasyoun commented 10 years ago

How many corrections where approved, how quickly is the proofreading moving on? As per 1 the topic (partly) is discussed at https://groups.google.com/forum/#!topic/bvparishat/IaCEYDmLmbI As per 2 the topic is discussed at https://groups.google.com/forum/#!topic/bvparishat/Eyz0lSNDk-s What I would go for and what I have done when working on an index of all Sanskrit words from all Cologne dictionaries is that I removed all the visargas and anusvaras at the end, but "remember" where I removed them for the sake of indexing and searching. Same with the double consonants before "r".

Shalu411 commented 10 years ago

Namaste Issue-1 : What I found on Apte was this- Where a word exists in all 3 genders, it will be left as a prAtipadika or substantive =base word. Eg. दृष्ट (dRSTa) nothing is mentioned beside the word except an "a"= adjective. We understand it as existing in all 3 genders. Whereas दृष्टिः is given with visarga and metioned with "f" because it is exclusively always feminine. (images attached of what he says about that issue in preface> Directions) May be that will help here as well. What I find in Vachaspatyam is it always mentioning base word, unlike SKD which has always first singular form. So there are different standards. For me Apte seems best. But that presupposes some Sanskrit knowledge. In looking into a real touchable book, its no issue at all - because you never bother if the word has visarga or not- untill you can find it in its alphabetical order. Its only in digital editions that it becomes an issue. I think may be.. may be- giving an option between both is better? Both ways searchable? As base word and with first form too?? Possible? pratip prati

Shalu411 commented 10 years ago

Namaste An interesting observation- in this line- http://www.sanskrit-lexicon.uni-koeln.de/scans/SKDScan/2013/web/webtc1/index.php स्वय [L=41138] [p= 5-474] - स्वयं, [म्] व्य, आत्मना । There is no word as "svaya"; only "svayaM" exists in language. The removing of "M" and "ः" in the end of the words can bring in disasters like introducing non-existant, ungrammatical words (ghost words) into the vocabulary of a language. So removing of "M" and "ः" and give as bases is agreeable trend only upto nominal bases, but cannot be extended to indeclinables. So please do not standardize this method. Apte seems to be much meaningful in this regard. Thankyou

gasyoun commented 10 years ago

This is an interesting issue. Indeclinables - they are not so many, can we have a list of those, whom we should not touch? It's a good point, I agree, non-words (apadam) is something we would not want to have. Jim, can we make a RegEx protection for words which contain avyaya markup? I do not see no simple solution http://research.ijcaonline.org/volume38/number6/pxc3876825.pdf

Shalu411 commented 10 years ago

Namaste The list, one can surely have. There is an avyaya kosha.. But a simpler way is to get them out of already programmed dictionaries itself by basis of some code words. Beside every avyaya word- SKD gives "व्य" ; Eg. [L=41138] [p= 5-474] स्वयं, [म्] व्य, आत्मना ।...... VCP gives "अव्य" ; Eg. [L=47570] [p= 5381] स्वयम्¦ अव्य० सु + अय--अमु । आत्मनेत्यर्थे अमरः । Apte gives "ind." Eg. स्वयम् ind. 1 Oneself, in one's own person... So if there is a way out to pull out these words, fine. Otherwise we can have a list prepared from Avyaya kosha. Its an exhaustible list, dependable.

funderburkjim commented 10 years ago
  1. Regarding list of indeclineables: The most sure list can be obtained from Monier-Williams, since there is markup under the 'lex' tag. Even better, there is a normalization of the 'lex' tag markup. A relatively simple program can filter out the headwords marked as 'ind'.
  2. I recently finished the installation of the digitization of 1890 Apte Sanskrit-English dictionary. In the specification of 'headwords', I made no 'simplifications'. So, for instance 'agniH' and 'svayam' are the spellings used for headwords.

By contrast, in SKD, in one step of the headword 'key' generation, the following simplification was done:

Remove ending 'm','M' and 'H' (for consistency with MW conventions)

In hindsight, this simplification may have been inappropriate.

Question: Should I retrofit SKD, avoiding this simplification?

Then 'agniH' and 'svayam' would be headword keys, as in AP90. (Note, the spelling in SKD is actually 'svayaM'. I think the conversion of final 'M' to 'm' remains appropriate.)

gasyoun commented 10 years ago

1) Words with "ind" markup would be a nice starting point. Could you please show a .txt file of them? 2) Retrofit might add even more issues, Shalu? I would love to see all kinds of lists of headwords for further decisions.

drdhaval2785 commented 7 years ago

hwnorm1 is where this should be handled. From my experience it is too difficult to do this change generically. Better to normalize in a shadow file like hwnorm1c.