sanskrit-lexicon / COLOGNE

Development of http://www.sanskrit-lexicon.uni-koeln.de/
18 stars 3 forks source link

Discussion about metaline for anekArthaka dictionaries (अनेकार्थक कोश) #405

Open drdhaval2785 opened 1 year ago

drdhaval2785 commented 1 year ago

Concept of anekArthaka dictionaries

There is a headword (with or without gender information). For that given headword, single / multiple meanings are given (with or without information). Meanings are not necessarily synonymic. Mostly they are not.

Explanation in mathematical terms

If anekArthaka relationship is denoted by f(n), A f(n) {B, C} would mean A = B A = C but B is not necessarily equal to C. It most probably would also hold true for B != C.

Sample data

From Amarakosha nAnArthavarga

मारुते वेधसि ब्रघ्ने पुंसि कः कं शिरोऽम्बुनोः । स्यात्पुलाकस्तुच्छधान्ये संक्षेपे भक्तसिक्थके ॥ ५ ॥

Problem to be handled

We need to devise a markup standard by which the information is captured without any loss, while encoding. We can use this information later on, for display or otherwise. We can later on generate synsets too.

Proposed markup (Edited per https://github.com/sanskrit-lexicon/COLOGNE/issues/405#issuecomment-1471218634)

<L>1<pc>102
<k1>क-पुं<meanings>मारुत-पुं,वेधस्-पुं,ब्रध्न-पुं
<k1>क-क्ली<meanings>शिरस्-पुं,अम्बु-क्ली
<k1>पुलाक-पुं<meanings>तुच्छधान्य-क्ली,भक्तसिक्थक-क्ली,संक्षेप-पुं
मारुते वेधसि ब्रघ्ने पुंसि कः कं शिरोऽम्बुनोः ।
स्यात्पुलाकस्तुच्छधान्ये संक्षेपे भक्तसिक्थके ॥ ५ ॥
<LEND>

In case, the gender information is absent / ambiguous, do not try too hard to interpret manually. We can leave the information blank. e.g. in the following, it is not clear whether अनुष्टुभ् or यशस् are neuter / feminine / musculine. So, kept them blank. (For later uses, this information can be pulled from other dictionaries if required). Better not to encode information explicitly when we are not sure about the same.

क;पुं<meanings>सूर्य,वेधस्
क;क्ली<meanings>सुख,मस्तक,जल
श्लोक;पुं<meanings>अनुष्टुभ्,यशस्
लोक;पुं<meanings>भुवन,जन
सूर्ये वेधसि वायौ कः कं सुखे मस्तके जले ।
अनुष्टुब्यशसोः श्लोको लोकस्तु भुवने जने ॥ १ ॥

Explanation of metaline

L is the lnum which would be unique for each headword:meanings pair. pc is page-column number detail to identify the page number. k1 is headword:gender information meanings is comma separated list of meaning:gender information.

drdhaval2785 commented 1 year ago

I need to create a small sample of 100 verses or so of this type and work with Jim to modify the make_xml.py to take this modified metaline structure into account and generate the XMLs which are more in sync with CDSL XML types.

drdhaval2785 commented 1 year ago

Note that I have used the gender and headword details in Devanagari to help easier filling of the data by non-technical non-SLP friendly people. They can work in native Devanagari script. पुं - musculine स्त्री - feminine क्ली - neuter अ - indeclinable This gender information can be expanded if needed.

Andhrabharati commented 1 year ago

Also add,

त्रि - adjective अ क्रि - intransitive verb स क्रि - transitive verb धा - root ...

the list can be expanded, as more dictionaries are being included.

drdhaval2785 commented 1 year ago

Typing : requires use of shift. It is quite tedious for repetitive work such as this. We can keep ; instead of : to improve typing speed. It would have no difference in the parsing.

drdhaval2785 commented 1 year ago
<L>१<pc>१४०<k1>क;पुं<meanings>सूर्य;पुं,वेधस्;पुं
<L>२<pc>१४०<k1>क;क्ली<meanings>सुख;क्ली,मस्तक;क्ली,जल;क्ली
<L>३<pc>१४०<k1>श्लोक;पुं<meanings>अनुष्टुभ्;स्त्री,यशस्;क्ली
<L>४<pc>१४०<k1>लोक;पुं<meanings>भुवन;पुं,जन;पुं
सूर्ये वेधसि वायौ कः कं सुखे मस्तके जले ।
अनुष्टुब्यशसोः श्लोको लोकस्तु भुवने जने ॥ १ ॥

It is not possible in some cases to decipher the gender from the case ending itself. I would not venture to provide gender details myself e.g. anuzwuB would be what gender - is not clear from the context. Similarly whether Buvana is musculine or neuter is not clear from the word Buvane. Both take the same case endings. In such cases, I think it is better to keep the gender information blank. Whenever there is clear indication about the gender (like headword or meanings in feminine saptamI form), I will capture the details. Otherwise, I will leave it.

Something like the following.

<L>१<pc>१४०<k1>क;पुं<meanings>सूर्य,वेधस्
<L>२<pc>१४०<k1>क;क्ली<meanings>सुख,मस्तक,जल
<L>३<pc>१४०<k1>श्लोक;पुं<meanings>अनुष्टुभ्,यशस्
<L>४<pc>१४०<k1>लोक;पुं<meanings>भुवन,जन
सूर्ये वेधसि वायौ कः कं सुखे मस्तके जले ।
अनुष्टुब्यशसोः श्लोको लोकस्तु भुवने जने ॥ १ ॥
drdhaval2785 commented 1 year ago

When the gender info in meaning is clear like “kambuni” i.e. saptami of “kambu” in neuter gender, I will keep such info. It is unambiguous case ending.

Andhrabharati commented 1 year ago

Typing : requires use of shift. It is quite tedious for repetitive work such as this. We can keep ; instead of : to improve typing speed. It would have no difference in the parsing.

In any normal context, the : denotes some relation between the two sides, whereas the ; denotes a separation.

I would suggest typing a simple - instead, which also denotes some kind of relation.

drdhaval2785 commented 1 year ago

Fine. I will use -

drdhaval2785 commented 1 year ago

@funderburkjim

Kindly find attached the file with markup of 120 verses made as per the discussion above. anhk1.txt

Go through the same and suggest what kind of XML do we need to create. If you can specify the structure of the XML, I can create a version of make_xml.py to generate the same.

gasyoun commented 1 year ago

@drdhaval2785 seems that the call has revived your idea, happy to see it. @Andhrabharati it's good to have you around.

Andhrabharati commented 1 year ago

@gasyoun I think no one else in the history would've looked at that many dictionaries (leaving all other varieties of works) in that many languages that I had done [and as deep as I 'delve'].

I am proud enough to say that no one ever can beat me in this; I just wish my experience be used beneficially by others (when I 'talk').

@drdhaval2785 Many Skt. koshas that are digitised by you (and/or me earlier) have the meanings also listed either in English or Hindi, in addition to Sankrit. [The Vaijayanti and the Harshakirti's Anekarthanamamala that you started as examples also fall under this category.]

Would it not be a good idea to add them as well, so that that it would be further useful to the end-users?

drdhaval2785 commented 1 year ago

I agree that if would be a good addition if we are able to capture the meanings in Hindi / English provided by the editor of the works. Only thing I am concerned about is that they are fresh works and may be in copyright.

gasyoun commented 1 year ago

fresh works and may be in copyright.

Let's make a list of whom @Andhrabharati considers as valuable.

funderburkjim commented 1 year ago

@drdhaval2785

Why multiple 'L' in an entry?

Example

<L>9<pc>140<k1>कटक-क्ली<meanings>कण्ठक,सैन्य,पर्वतनितम्ब
<L>10<pc>140<k1>कण्टक-क्ली<meanings>रोमहर्ष,सूच्यग्र,क्षुद्रवैरिन्
कटकं कण्ठके सैन्ये नितम्बे पर्वतस्य च ।
कण्टकं रोमहर्षे स्यात् सूच्यग्रे क्षुद्रवैरिणि ॥ ३ ॥
<LEND>

Possible alternative:

<L>9<pc>140
<k1>कटक-क्ली<meanings>कण्ठक,सैन्य,पर्वतनितम्ब
<k1>कण्टक-क्ली<meanings>रोमहर्ष,सूच्यग्र,क्षुद्रवैरिन्
कटकं कण्ठके सैन्ये नितम्बे पर्वतस्य च ।
कण्टकं रोमहर्षे स्यात् सूच्यग्रे क्षुद्रवैरिणि ॥ ३ ॥
<LEND>

The value of <L> is primarily an identifier of the 'record' or 'document' or 'entry'.

In this case the 'entry' is the 'document' containing the next 4 lines. (all the lines up to, but not including <LEND>).

L is numeric, and the sqlite database is typically ordered by L.

A basic (or mobile1) display would display the 'document' (these 4 lines, with some html prettification) from user entry of any of the 8 words कटक through क्षुद्रवैरिन्. Right?

what is क्ली?

Secondary question: what is क्ली an abbreviation for ? I guess some gender information. What are all the possible gender-information abbreviations. Are all the words substantives whose gender-information is, if present, some collection of masculine, feminine, neuter, indeclineable (m,f,n,ind.) ?

drdhaval2785 commented 1 year ago

klI is shorthand for klIba i.e. neuter gender

drdhaval2785 commented 1 year ago

I agree regarding your suggestion to use L number to refer to the entry. Will update the files accordingly. And your understanding is correct that any of those 8 headwords should lead to this entry.

All possible gender information: I am not sure whether I know this beforehand. It will depend on the dictionary being added. In some dictionaries, information about roots will be there. In some words, the lexicon also mentions that the word is always used in plural d.g. dArAH. As and when I come across such info, I will add a new abbreviation to capture this info.

Currently I run a script which captures all the gender information markup and its frequency of occurrence in the dictionary. I will post such info for each dictionary being added. I will also keep a file which will note all abbreviations being used in all koshas. This will help us to create tooltips if needed.

drdhaval2785 commented 1 year ago

Updated the file according to the requirements specified above. anhk1.txt

drdhaval2785 commented 1 year ago

The gender information is as follows for the present work [('पुं', 174), ('स्त्री', 90), ('क्ली', 77), ('अ', 29)]

drdhaval2785 commented 1 year ago

Based on #409 , the file ankh1.txt is modified to have a unique identifier per anekArthaka word-meanings set. eid stands for extra id.

anhk1.txt

<L>1<pc>140
<eid>1<k1>क-पुं<meanings>सूर्य,वेधस्
<eid>2<k1>क-क्ली<meanings>सुख,मस्तक,जल
<eid>3<k1>श्लोक-पुं<meanings>अनुष्टुभ्,यशस्
<eid>4<k1>लोक-पुं<meanings>भुवन,जन
सूर्ये वेधसि वायौ कः कं सुखे मस्तके जले ।
अनुष्टुब्यशसोः श्लोको लोकस्तु भुवने जने ॥ १ ॥
<LEND>
<L>2<pc>140
<eid>5<k1>अङ्क-पुं<meanings>उत्सङ्ग,चिह्न
<eid>6<k1>कलङ्क-पुं<meanings>अङ्क,अपवाद
<eid>7<k1>कौशिक-पुं<meanings>इन्द्र,घूक
<eid>8<k1>पृथुक-पुं<meanings>चिपिट,अर्भक
उत्सङ्गचिह्नयोरङ्कः कलङ्कोऽङ्कापवादयोः ।
इन्द्रे घूके कौशिकः स्यात् पृथुकौ चिपिटार्भकौ ॥ २ ॥
drdhaval2785 commented 1 year ago

@funderburkjim When you add this dictionary to CDSL, kindly keep ANHK as the dictionary code. This is the code I have used for this dictionary in my sanskrit-kosha project. Keeping the same code will help me track the lexica across repositories without hassle.

funderburkjim commented 1 year ago

@drdhaval2785 I missed your request to use ANHK . Will make v3 to change from HARSA to ANHK.