Closed drdhaval2785 closed 7 years ago
The letter-number system (AS) is used only as a way to represent printed text that is composed of Latin alphabet with diacritics. It is language agnostic -- can represent Sanskrit words, French words, German words, etc.
When reprsenting Sanskrit words, there are differences among texts as to the representation using Latin letters with diacritics.
It is reasonable to have the primary form of the dictionary use modern IAST conventions as a way to represent the Sanskrit words Latin-alphabet-with-diacritics. That's the principle I'm using with these dictionaries, IAST is better than AS, and modern IAST is better than historical IAST.
In the Monier Williams dictionary, I took an approach closer to what you are suggesting, I think; namely, I coded all the Sanskrit words appearing in Latin-alphabet in two forms: one retaining the original Latin alphabet representation and one converting that to SLP1.
Thus it would be possible to produce a version of other dictionaries (say MD, ACC), where the Sanskrit words would be further (or dually) represented not in IAST but in SLP1 also. However, this is also a big task. The hard parts are:
Dealing with words that have partially come into the base language. Like Pluralizing 'Yogas', etc.
I don't plan to undertake this IAST->SLP1 conversion of Sanskrit words any time soon. Conversion to IAST is a big enough task for me now.
But if you want to tackle this for a particular dictionary, I'll work with you if you like.
Conversion to IAST is a big enough task for me now.
Too big and no actual need. Time will come.
Here is the status of AS/IAST conversion for the various dictionaries:
acc DONE approx. 05/20/2017 Also meta-line conversion
ae done? Not much to do. ae-meta not updated.
ap90 done, with meta-line conversion. 06/30/2017
ap IAST conversion. Done in April. meta-line conversion 07/11/2017
ben done 04/10/2017
bhs todo
bop todo
bor todo
bur IAST April 2017. IAST corrections, and conversion to meta-line form 07/28/2017
cae todo
ccs todo
gra todo
gst todo
ieg todo
inm todo
krm todo not much IAST
mci todo
md done 03/27/2017 . meta-line conversion 07/07/2017
mw72 DONE- Converted, still some non-standard IAST
mwe todo not much IAST
mw todo
pd todo
pe todo
pgn todo
pui todo
pwg todo also line-length adjustment
pw todo also line-length adjustment
sch DONE 04/27/2017 meta-line Done 06/22/2017
shs DONE 08/06/2017 meta-line Done 08/08/2017
skd no IAST conversion required. meta-line Done 08/24/2017
snp todo
stc todo
vcp DONE no IAST conversion required, all Devanagari; meta-line-conversion done 08/16/2017
vei todo
wil DONE 06/20/2017. Also meta-line conversion
yat DONE 05/31/2017. Also meta-line conversion.
A daunting amount of work to do, but it seems worthwhile to convert all the AS coding:
I'll fill in the table above as progress is made.
A daunting amount of work to do
Yeah, it's a hunt.
Most wanted:
sch todo
gra todo
mw todo
pd todo
pwg todo also line-length adjustment
pw todo
It is a dubious honor to be the only assignee .
Yeah, let's give @drdhaval2785 or @vvasuki as try :+1:
A separate question: All sanskrit words are clearly identified in the xml-s ?
Yeah, let's give @drdhaval2785 or @vvasuki as try :+1:
(I must decline, @gasyoun - far too occupied by separate sanskrit projects.)
A separate question: All sanskrit words are clearly identified in the xml-s ?
No and will not be, if you do not invent a regex and do manual cleanup after.
> A separate question: All sanskrit words are clearly identified in the xml-s ?
No and will not be, if you do not invent a regex and do manual cleanup after.
@funderburkjim At the very least, the words you do convert to IAST or SLP1 should be marked up (identifiable as sanskrit words). Reason: downstream users would want to see those words in the script of their convenience for easy reading/ lookup.
at the very least.
Easier said than done. It would be good to have all the IAST sanskrit words identified as such, which is your suggestion. I'll put this in my todo list.
Words which appear in Devanagari in a printed text are generally identifiable (with an <s>
tag, coded as SLP1) .
Sanskrit words which appear in the text as IAST are the problem. The problem is distinguishing such words as Sanskrit, rather than words in some other language. For instance, consider the words gam, guru, etc. These have no diacritics, so how do we know they are Sanskrit and not English or French or German, Latin, etc? What about Sanskrit words adopted into English, which may appear in plurals such as 'yogas', 'karmas', etc.?
This IAST word classification has been done only for MW.
good to have all the IAST sanskrit words identified as such, which is your suggestion. I'll put this in my todo list.
Oh, right, that's feasible.
Yeah, let's give @drdhaval2785 or @vvasuki as try 👍
I will try it. Will post here if I do something substantial in this direction.
For instance, consider the words gam, guru, etc. These have no diacritics, so how do we know they are Sanskrit and not English or French or German, Latin, etc?
And there we have the best sanskrit dictionaries to make the distinction :-) .
Even if they're not covered (now / in the near future) its still ok to mark off just the ones we do end up converting to IAST / SLP as indic - it will help the final user downstream to that extant.
What about Sanskrit words adopted into English, which may appear in plurals such as 'yogas', 'karmas', etc.?
I'd say check the prefix minus the terminal s and mark it as an indic word. But that's not so important as these are few.
But that's not so important as these are few.
Hundreds, and just regexing will not help, more variants occur. And as it's of no priority for Jim now, let's leave it. There should be an Indian interested in weeding out the words. All we have tried, @vvasuki for the last 3 years was at least to clean the headwords. In 2-3 years will be there. The amount of work done is like 1/10 compared to what has to be done inside the dictionaries. But as I adore what Jim does a lot, I would not want him to do what others can, only where he is best. That's my take and I'll stand on it. Headwords first. Additional markup - let India wake up and tell when she is reading for some Sanskrit NLP. It's just about time. Otherwise best research on Sanskrit for last 200 years is done outside India.
Otherwise best research on Sanskrit for last 200 years is done outside India.
You certainly know how to push the right buttons! But I'll let that pass considering the source, time and place :-)
And as it's of no priority for Jim now, let's leave it.
Of course, it's up for Jim to decide (and I'm not insisting) - you've made your opinion clear.
you've made your opinion clear.
In 2006 there was nothing. Not even PWG was online. In 10 years a lot has changed. But it's sad to see that the role of people from India (other than Dhaval) is so small. What I see is that a single person can do as much as an academic institute. It's a pity to see such conditions.
But it's sad to see that the role of people from India (other than Dhaval) is so small.
There is are proper times, places and forms to express such sadness and examine the causes. This certainly isn't it. What's the relevance of these notes to the task at hand? You are not going to "guilt" or irritate Indians into changing their priorities and jumping in by noting such things here (of all places). In any case, do note that all these dictionaries were manually typed in the first place by Indians paid by Europeans.
What I see is that a single person can do as much as an academic institute. It's a pity to see such conditions.
Again, this is neither relevant to the current issue nor helpful. Which academic institute will change because of your comment here? Or can we look for some engineering insight hidden in this unlikely source? I am all for praising dhaval but I think he would take manu quite seriously "सम्मानाद्ब्राह्मणो नित्य- मुद्विजेत विषादिव । अमृतस्येव चाकाङ्क्षे- दवमानस्य सर्वदा ।।"
Seriously, if you want to increase Indian participation, write an email to sanskrit-programmers linking to various issues in these projects where they can contribute to and invite contribution by python programmers (without absurdly insulting Indian scholarship along the way). That's far more likely to be productive.
email to sanskrit-programmers
I've not seen anything big enough, some small projects and that's it. People code as a hobby and only what they like. These tasks are bigger than just hobby. Can you document what you see, can I ask you for a favor? You know the coders, I do not. If not Thomas, there would be nobody who typed. And I know the picture as you do as well.
I've not seen anything big enough, some small projects and that's it. People code as a hobby and only what they like.
And people such as those here do not? And people there would not like contributing here? People should indeed do what they truly think is important and enjoyable for themselves. Culture need not be advanced by the miserable. It is quite arrogant to think that others ought to share your priorities (smacks of an extension of the classic "white man's burden" ).
These tasks are bigger than just hobby.
There are tasks that are bigger than just hobby, but it is false that hobbyists cannot make significant contributions here.
Can you document what you see, can I ask you for a favor? You know the coders, I do not.
No - I don't know the active coders - same as you. Just shrIvatsa, who is likely to pass. If you're too busy to write the email, I can, of course.
If not Thomas, there would be nobody who typed.
If it were not for such Indians, there would be nobody who typed as well.
And I know the picture as you do as well.
I know that you know, and you know that I know that you know. Just bringing the picture into "the picture" and clearing selective amnesia.
Will use table at #177 and retire above table.
I am for SLP1. But if conversion or identification is difficult, IAST will also do.
For all dict.xml please. This will bring uniformity in all XMLs. Need for separate disay tools for different dictionaries can be reduced.