Open drdhaval2785 opened 1 year ago
I need to create a small sample of 100 verses or so of this type and work with Jim to modify the make_xml.py to take this modified metaline structure into account and generate the XMLs which are more in sync with CDSL XML types.
Note that I have used the gender and headword details in Devanagari to help easier filling of the data by non-technical non-SLP friendly people. They can work in native Devanagari script. पुं - musculine स्त्री - feminine क्ली - neuter अ - indeclinable This gender information can be expanded if needed.
Also add,
त्रि - adjective अ क्रि - intransitive verb स क्रि - transitive verb धा - root ...
the list can be expanded, as more dictionaries are being included.
Typing :
requires use of shift. It is quite tedious for repetitive work such as this.
We can keep ;
instead of :
to improve typing speed. It would have no difference in the parsing.
<L>१<pc>१४०<k1>क;पुं<meanings>सूर्य;पुं,वेधस्;पुं
<L>२<pc>१४०<k1>क;क्ली<meanings>सुख;क्ली,मस्तक;क्ली,जल;क्ली
<L>३<pc>१४०<k1>श्लोक;पुं<meanings>अनुष्टुभ्;स्त्री,यशस्;क्ली
<L>४<pc>१४०<k1>लोक;पुं<meanings>भुवन;पुं,जन;पुं
सूर्ये वेधसि वायौ कः कं सुखे मस्तके जले ।
अनुष्टुब्यशसोः श्लोको लोकस्तु भुवने जने ॥ १ ॥
It is not possible in some cases to decipher the gender from the case ending itself.
I would not venture to provide gender details myself e.g. anuzwuB
would be what gender - is not clear from the context.
Similarly whether Buvana
is musculine or neuter is not clear from the word Buvane
. Both take the same case endings. In such cases, I think it is better to keep the gender information blank. Whenever there is clear indication about the gender (like headword or meanings in feminine saptamI form), I will capture the details. Otherwise, I will leave it.
Something like the following.
<L>१<pc>१४०<k1>क;पुं<meanings>सूर्य,वेधस्
<L>२<pc>१४०<k1>क;क्ली<meanings>सुख,मस्तक,जल
<L>३<pc>१४०<k1>श्लोक;पुं<meanings>अनुष्टुभ्,यशस्
<L>४<pc>१४०<k1>लोक;पुं<meanings>भुवन,जन
सूर्ये वेधसि वायौ कः कं सुखे मस्तके जले ।
अनुष्टुब्यशसोः श्लोको लोकस्तु भुवने जने ॥ १ ॥
When the gender info in meaning is clear like “kambuni” i.e. saptami of “kambu” in neuter gender, I will keep such info. It is unambiguous case ending.
Typing
:
requires use of shift. It is quite tedious for repetitive work such as this. We can keep;
instead of:
to improve typing speed. It would have no difference in the parsing.
In any normal context, the :
denotes some relation between the two sides, whereas the ;
denotes a separation.
I would suggest typing a simple -
instead, which also denotes some kind of relation.
Fine. I will use -
@funderburkjim
Kindly find attached the file with markup of 120 verses made as per the discussion above. anhk1.txt
Go through the same and suggest what kind of XML do we need to create. If you can specify the structure of the XML, I can create a version of make_xml.py to generate the same.
@drdhaval2785 seems that the call has revived your idea, happy to see it. @Andhrabharati it's good to have you around.
@gasyoun I think no one else in the history would've looked at that many dictionaries (leaving all other varieties of works) in that many languages that I had done [and as deep as I 'delve'].
I am proud enough to say that no one ever can beat me in this; I just wish my experience be used beneficially by others (when I 'talk').
@drdhaval2785 Many Skt. koshas that are digitised by you (and/or me earlier) have the meanings also listed either in English or Hindi, in addition to Sankrit. [The Vaijayanti and the Harshakirti's Anekarthanamamala that you started as examples also fall under this category.]
Would it not be a good idea to add them as well, so that that it would be further useful to the end-users?
I agree that if would be a good addition if we are able to capture the meanings in Hindi / English provided by the editor of the works. Only thing I am concerned about is that they are fresh works and may be in copyright.
fresh works and may be in copyright.
Let's make a list of whom @Andhrabharati considers as valuable.
@drdhaval2785
Example
<L>9<pc>140<k1>कटक-क्ली<meanings>कण्ठक,सैन्य,पर्वतनितम्ब
<L>10<pc>140<k1>कण्टक-क्ली<meanings>रोमहर्ष,सूच्यग्र,क्षुद्रवैरिन्
कटकं कण्ठके सैन्ये नितम्बे पर्वतस्य च ।
कण्टकं रोमहर्षे स्यात् सूच्यग्रे क्षुद्रवैरिणि ॥ ३ ॥
<LEND>
Possible alternative:
<L>9<pc>140
<k1>कटक-क्ली<meanings>कण्ठक,सैन्य,पर्वतनितम्ब
<k1>कण्टक-क्ली<meanings>रोमहर्ष,सूच्यग्र,क्षुद्रवैरिन्
कटकं कण्ठके सैन्ये नितम्बे पर्वतस्य च ।
कण्टकं रोमहर्षे स्यात् सूच्यग्रे क्षुद्रवैरिणि ॥ ३ ॥
<LEND>
The value of <L>
is primarily an identifier of the 'record' or 'document' or 'entry'.
In this case the 'entry' is the 'document' containing the next 4 lines. (all the lines up to, but not including <LEND>
).
L is numeric, and the sqlite database is typically ordered by L.
A basic (or mobile1) display would display the 'document' (these 4 lines, with some html prettification) from user entry of any of the 8 words कटक through क्षुद्रवैरिन्. Right?
Secondary question: what is क्ली an abbreviation for ? I guess some gender information. What are all the possible gender-information abbreviations. Are all the words substantives whose gender-information is, if present, some collection of masculine, feminine, neuter, indeclineable (m,f,n,ind.) ?
klI is shorthand for klIba i.e. neuter gender
I agree regarding your suggestion to use L number to refer to the entry. Will update the files accordingly. And your understanding is correct that any of those 8 headwords should lead to this entry.
All possible gender information: I am not sure whether I know this beforehand. It will depend on the dictionary being added. In some dictionaries, information about roots will be there. In some words, the lexicon also mentions that the word is always used in plural d.g. dArAH. As and when I come across such info, I will add a new abbreviation to capture this info.
Currently I run a script which captures all the gender information markup and its frequency of occurrence in the dictionary. I will post such info for each dictionary being added. I will also keep a file which will note all abbreviations being used in all koshas. This will help us to create tooltips if needed.
Updated the file according to the requirements specified above. anhk1.txt
The gender information is as follows for the present work [('पुं', 174), ('स्त्री', 90), ('क्ली', 77), ('अ', 29)]
Based on #409 , the file ankh1.txt is modified to have a unique identifier per anekArthaka word-meanings set. eid stands for extra id.
<L>1<pc>140
<eid>1<k1>क-पुं<meanings>सूर्य,वेधस्
<eid>2<k1>क-क्ली<meanings>सुख,मस्तक,जल
<eid>3<k1>श्लोक-पुं<meanings>अनुष्टुभ्,यशस्
<eid>4<k1>लोक-पुं<meanings>भुवन,जन
सूर्ये वेधसि वायौ कः कं सुखे मस्तके जले ।
अनुष्टुब्यशसोः श्लोको लोकस्तु भुवने जने ॥ १ ॥
<LEND>
<L>2<pc>140
<eid>5<k1>अङ्क-पुं<meanings>उत्सङ्ग,चिह्न
<eid>6<k1>कलङ्क-पुं<meanings>अङ्क,अपवाद
<eid>7<k1>कौशिक-पुं<meanings>इन्द्र,घूक
<eid>8<k1>पृथुक-पुं<meanings>चिपिट,अर्भक
उत्सङ्गचिह्नयोरङ्कः कलङ्कोऽङ्कापवादयोः ।
इन्द्रे घूके कौशिकः स्यात् पृथुकौ चिपिटार्भकौ ॥ २ ॥
@funderburkjim When you add this dictionary to CDSL, kindly keep ANHK as the dictionary code. This is the code I have used for this dictionary in my sanskrit-kosha project. Keeping the same code will help me track the lexica across repositories without hassle.
@drdhaval2785 I missed your request to use ANHK . Will make v3 to change from HARSA to ANHK.
Concept of anekArthaka dictionaries
There is a headword (with or without gender information). For that given headword, single / multiple meanings are given (with or without information). Meanings are not necessarily synonymic. Mostly they are not.
Explanation in mathematical terms
If anekArthaka relationship is denoted by f(n), A f(n) {B, C} would mean A = B A = C but B is not necessarily equal to C. It most probably would also hold true for B != C.
Sample data
From Amarakosha nAnArthavarga
मारुते वेधसि ब्रघ्ने पुंसि कः कं शिरोऽम्बुनोः । स्यात्पुलाकस्तुच्छधान्ये संक्षेपे भक्तसिक्थके ॥ ५ ॥
Problem to be handled
We need to devise a markup standard by which the information is captured without any loss, while encoding. We can use this information later on, for display or otherwise. We can later on generate synsets too.
Proposed markup (Edited per https://github.com/sanskrit-lexicon/COLOGNE/issues/405#issuecomment-1471218634)
In case, the gender information is absent / ambiguous, do not try too hard to interpret manually. We can leave the information blank. e.g. in the following, it is not clear whether अनुष्टुभ् or यशस् are neuter / feminine / musculine. So, kept them blank. (For later uses, this information can be pulled from other dictionaries if required). Better not to encode information explicitly when we are not sure about the same.
Explanation of metaline
L
is the lnum which would be unique for eachheadword:meanings
pair.pc
is page-column number detail to identify the page number.k1
isheadword:gender
informationmeanings
is comma separated list ofmeaning:gender
information.