Request of IDs for MW meanings (additional to L for entries)

gasyoun commented 9 years ago

Let me explain https://github.com/sanskrit-lexicon/MWS/issues/7 "24) Can we get a unique ID for every meaning in MW as well?" in more details. Otherwise it could get lost even for the sake of discussion.

I've been working with MW for ages and have tried a lot to understand what we should do with it (just today stumbled upon http://www.dialog-21.ru/digests/dialog2006/materials/html/Gasuns.htm) in the future. Let me tell you my story, why I need to have an ID for each "translation" as well. 1) I want to translate ayurvedic literature to Russian. Possible victims:

I've heard the the 2nd German translation took 6.5 hears to make it. And it was not even a word by word translation, so I'm not sure if it's a good idea, but my ayurvedic doctor wants to translate AHS. That means I have to translate it myself before he does it so I can help him in some way. 2) But before it becomes Russian, I need to make an English word by word translation. The good news is that Oliver has already partly solved the task in 2012 at his DSC. He has assigned an ID to every word (not meaning) and I'm not aware weather the AHS has any missasignments, for that I'll need additional help from an Indian ayurvedic researcher. The bad news - at some point and he does not remember why Oliver sorted the meanings inside every dictionary entry alphabetically (see http://samskrtam.ru/dsc-bugs/). That means that the frequent, wanted meanings starting with z will never be there, because the logical order of the entries is broken. Only because of that a huge amount of work has to be redone now. A single example. अर्थhas first (=main) meaning aim. But in Oliver's "new order" it's fifth and as I take only the first 3 meanings of every entry, it is left out in my automatic word by word translation file. 3) If every translation will have it's own ID similar to the L numbers, than I will be able to start develop a system of improving the quallity of the word by word translations. First of all I will include:

translations marked with Med. tag
translations marked with Caraka and other medical authorities
translations in the beginning of the list, those that come first

After that I would make a voting system. An ayurvedic doctor reads the text and clicks on the most appropriate word out from the 3-5 word list. We gather statistics for these words http://kjc-fs-cluster.kjc.uni-heidelberg.de/dcs/index.php?contents=textdictionary&IDText=10 which meaning they preffer. After that maybe even an auto correction mode can be made. We "vote" for words in one chapter and the quallity score for different chapters might improve as well. We would need to map which L nubmers match witch einzelwort&IDWord at Oliver (sample http://kjc-fs-cluster.kjc.uni-heidelberg.de/dcs/index.php?contents=einzelwort&IDWord=104 = L=224 and without a why or a wherefore, accidentally, suddenly. is split by ,). There are 72 000 words used at Oliver's version of MW. 4) After that I will need to try to match Russian translations of MW at http://www.yukta.org/download.php?lang=rus and verify them at my Parallel Sanskrit Corpus, like http://samskrtam.ru/parallel-mahabharata/ I do not know if I should try to make a macro to work inside Word, Adobe Acrobat or some web UI. That is not that important know. What matters is that every meaning should have it's ID. Does it seams reasonable and possible, @funderburkjim ? @drdhaval2785 , is it possible to understand what I speak about?

funderburkjim commented 9 years ago

Your links are aesthetically pleasing. Is this your work?

Your suggestion about tagging meanings is a new idea to me. It is definitely a research idea, and should be worked on in a 'dev' environment, and not the main trunk of current MW.

A way for you to begin materializing your thoughts might be to mark up by hand some MW records. That way others can see something specific that they might be able to constructively criticize.

You can also promote interest in your idea by expanding your explanations - I sampled several of the links, and they looked interesting and like someone has spent a lot of time and effort developing them. But, following the train of thought is hard. Just as my readme files are often obscure, these links are also obscure. Unless you know otherwise, assume your reader has absolutely no idea of what you're talking about - don't assume your reader is interested in 'reading your mind'. Be absurdly simple and clear. That will help you get useful feedback.

gasyoun commented 9 years ago

Yes, it's mine. Under my guidance we have developed an .xls macro that takes the .html input and gives out a .doc with styles, after that which we print to .pdf as well. It's not that research, it's already there for ten years at http://yukta.org/download.php?lang=rus (esp. http://yukta.org/download/base_yukta.zip MySQL dump), a partly translation of MW to Russian.

INSERT INTO er VALUES (142252,' an iron chain worn round the loins W. ','железная цепь, носимая вокруг поясницы ','W. ');
INSERT INTO er VALUES (142253,' a partic. measure of length L. ','определ. мера длины ','L. ');
INSERT INTO er VALUES (142254,' ploughing in the regular direction (= %{anuloma-karSaNa}) L. ','пашущий в правильном направлении ','(= %{anuloma-karSaNa}) L. ');
INSERT INTO er VALUES (142255,' the second ploughing of a field W. ','повторная вспашка поля ','W. ');
INSERT INTO er VALUES (142256,' N. of an Asura (cf. %{zambara}) TBr. Sch. ','имя асуры','(cf. %{zambara}) TBr. Sch. ');
INSERT INTO er VALUES (142257,'happy , fortunate L. (cf. %{zaM-vat}) ','счастливый, удачливый ',' L. (cf. %{zaM-vat}) ');

To expand the explanations I will need your help, because I need to understand how to relink every entry to what I see at http://kjc-fs-cluster.kjc.uni-heidelberg.de/dcs/index.php?contents=corpus Do you have an idea how we can know if, for example, http://kjc-fs-cluster.kjc.uni-heidelberg.de/dcs/index.php?contents=einzelwort&IDWord=325 is akṣa 1 [p= 3,1] [L=424] from http://www.sanskrit-lexicon.uni-koeln.de/scans/MWScan/2014/web/webtc2/index.php? More details about why the relinking is wanted at http://samskrtam.ru/dsc-bugs/

funderburkjim commented 9 years ago

Under the 'Meanings' section at http://kjc-fs-cluster.kjc.uni-heidelberg.de/dcs/index.php?contents=einzelwort&IDWord=325, there are 8 text definitions. By looking at akza in http://www.sanskrit-lexicon.uni-koeln.de/scans/MWScan/2014/web/webtc/indexcaller.php, the L numbers are attached. There are several other L numbers for akza that do not appear in the list.

    a lawsuit (Monier-Williams, Sir M. (1988))  L=521
    a person born blind (Monier-Williams, Sir M. (1988))  L=522
    knowledge (Monier-Williams, Sir M. (1988)) L = 520
    name of a son of Nara (Monier-Williams, Sir M. (1988)) L = 525
    name of a son of Rāvaṇa (Monier-Williams, Sir M. (1988))  L = 524
    name of Garuḍa (Monier-Williams, Sir M. (1988))  L = 523
    religious knowledge (Monier-Williams, Sir M. (1988)) L = 520  (Note also 520 above)
    the soul (Monier-Williams, Sir M. (1988))  L = 519

Also, the definitions above are sometimes 'expansions' of MW definitions.

If you had (a) the headword (= akza) and (b) the list of definitions, it would probably be possible to find the L numbers of MW which correspond to the definitions.

If you had (b), it might be possible to devise a scheme to deduce (a).

gasyoun commented 9 years ago

By 'expansions' you mean the religious knowledge case? I guess it's easier to use a for matching the L numbers, but in this case we have b as well. Any idea where to start? Additionally to the 72 000 headwords from MW we have from Oliver's site there is some simplified lexical markup he has as well that might turn up useful.

funderburkjim commented 9 years ago

re: By 'expansions' you mean the religious knowledge case? That's one; another is 'name' in 523-5.

'Any idea where to start?' A download of Oliver's data.

General principle: computer programs have input and output. The input has to be completely specified (simplest case: the input is all the data in a particular computer file).

Application of general principle: You must specify the 'input' for the computer program you have in mind. That seems to me the place to start.

gasyoun commented 9 years ago

@funderburkjim https://github.com/sanskrit-lexicon/DCS/blob/master/DCS-72034-gramm-tag-stats.csv is one of the downloads I've made. Should it be enough, as a?

funderburkjim commented 9 years ago

So that is a list of headwords.

Where are the definitions?

Also, the file of headwords shows many '?' characters: aka??akin, aka??aka ;adj, , aka??hya ;adj .

drdhaval2785 commented 3 years ago

@gasyoun, I am not sure whether I understand you here. Can you rephrase what you want done?

Andhrabharati commented 6 months ago

@funderburkjim / @gasyoun,

Is this issue closable now?

funderburkjim commented 6 months ago

Since this issue is mentioned in @drdhaval2785 list at https://github.com/sanskrit-lexicon/COLOGNE/issues/325, Let's close this particular issue.

@drdhaval2785 perhaps your list should be reviewed and status updated?

sanskrit-lexicon / MWS

Request of IDs for MW meanings (additional to L for entries) #19