sanskrit-lexicon / COLOGNE

Development of http://www.sanskrit-lexicon.uni-koeln.de/
18 stars 3 forks source link

How do I submit a new dictionary? #336

Open drdhaval2785 opened 3 years ago

drdhaval2785 commented 3 years ago

This issue may be treated as a documentation for the future contributor who may wish to contribute a dictionary to Cologne Sanskrit Dictionaries.

I now have a Sanskrit-Sanskrit dictionary ready with proper tagging of headwords. How can I integrate it at Cologne, @funderburkjim ?

Dictionary name - अभिधानरत्नमाला Author - हलायुध

Files

TXT - https://github.com/sanskrit-kosha/kosha/blob/master/abhidhanaratnamala_halayudha/orig/abhidhanaratnamala.txt BABYLON - https://github.com/sanskrit-kosha/kosha/blob/master/abhidhanaratnamala_halayudha/babylon/abhidhanaratnamala.babylon XML - https://github.com/sanskrit-kosha/kosha/blob/master/abhidhanaratnamala_halayudha/xml/abhidhanaratnamala.xml HTML - https://github.com/sanskrit-kosha/kosha/blob/master/abhidhanaratnamala_halayudha/html/abhidhanaratnamala.html JSON - https://github.com/sanskrit-kosha/kosha/blob/master/abhidhanaratnamala_halayudha/json/abhidhanaratnamala.json MD - https://github.com/sanskrit-kosha/kosha/tree/master/abhidhanaratnamala_halayudha/md STARDICT - https://github.com/sanskrit-kosha/kosha/tree/master/abhidhanaratnamala_halayudha/stardict

metadata about the file

;METADATA
;title{हलायुधकोशः (अभिधानरत्नमाला)}
;author{हलायुध}
;bookFullName{हलायुधकोशः (अभिधानरत्नमाला)}
;bookSeriesDetails{हिन्दी समिति प्रभाग ग्रन्थमाला - १५०}
;editor{जयशङ्करजोशी}
;editorQualifications{}
;publisher{विनोद चन्द्र पाण्डेय, निदेशक, उत्तर प्रदेश हिन्दी संस्थान, राजर्षि पुरुषोत्तमदास टण्डन हिन्दी भवन, महात्मा गांधी मार्ग, लखनउ - २२६००१}
;pressDetails{शिवम् प्रिन्टर्स, सी. २७/२७३, इण्डियन प्रेस कोलोनी, मलदहिया, वाराणसी - २२१००२}
;dataEntryBy{Dr. Dhaval Patel}
;dataEntryEmail{drdhaval2785@gmail.com}
;proofReadBy{Dr. Dhaval Patel}
;proofReaderEmail{drdhaval2785@gmail.com}
;annotatedBy{}
;annotatorEmail{}
;version{0.3.5}
;projectDetails{This project is aimed at creating a database and related software tools to access Indian koshas, both online and offline. The project is funded by generous donation of Shree Ramkrishna Knowledge Fondation, Surat.}
;projectWebPage{http://github.com/sanskrit-kosha/kosha}
;emailTo{drdhaval2785@gmail.com}
;description{}
;shortCode{ARMH}
;funding{Shree Ramkrishna Knowledge Foundation.}
;licence{GNU GPL v3.0}
;credits{1. SRKKF for funding. 2. www.archive.org for providing us the scanned book to digitize. 3. Dr. Dhaval Patel for spending time to proofread the data.}
;dataFormatDetails{See https://github.com/sanskrit-kosha/kosha/blob/master/docs/annotation_thoughts.md for details.}
;editorialChanges{}
;nymic{mixed}
;pagenum{true}
;linenum{false}
;chapterArrangements{kanda}
;newVerseNumbersAtChangeOf{never}
;newLineNumbersAtChangeOf{never}
;version0.0.1{25 June 2019}
;version0.0.2{25 June 2019}
;version0.0.3{01 July 2019}
;version0.0.4{01 July 2019}
;version0.1.0{02 July 2019}
;version0.2.0{14 January 2021}
;version0.2.1{17 January 2021}
;version0.2.2{17 January 2021}
;version0.3.0{17 January 2021}
;version0.3.1{17 January 2021}
;version0.3.2{17 January 2021}
;version0.3.3{17 January 2021}
;version0.3.4{17 January 2021}
;version0.3.5{17 January 2021}
;version0.3.6{}
;version1.0.0{}
drdhaval2785 commented 3 years ago

TXT file may be treated as the starting point of the whole process. Rest all formats are generated by a dirty script at https://github.com/sanskrit-kosha/kosha/blob/master/scripts/parse_data.py .

gasyoun commented 3 years ago

TXT file may be treated as the starting point of the whole process.

Oh, so 3rd Sanskrit-Sanskrit dictionary - great news today we have.

funderburkjim commented 3 years ago

@gasyoun You asked me a similar question regarding lanman recently. Do you remember where this question and my answer are?

funderburkjim commented 3 years ago

First steps: make dictionary code X (maybe X=ABD)

construct x.txt in the metaline format.

Get a set of scanned images of the dictionary -- one image (pdf) per page.

You will need the page numbers for the <pc> field of the metaline format of x.txt.

Let me take a look at x.txt to see if it looks conformant.

gasyoun commented 3 years ago

Do you remember where this question and my answer are?

https://github.com/sanskrit-lexicon/hwnorm1/issues/17#issuecomment-753690269 7 places

drdhaval2785 commented 3 years ago

make dictionary code X (maybe X=ABD)

Can I keep it 4 letters please? There are plenty of dicts starting with abhidhAna... / ekAkshara... / nAnArtha.... etc. Empirically I have come to a conclusion that 4 letters are necessary to identify a dictionary unambiguously.

I would keep ARMH for abhidhAnaratnamAlA of Halayudha.

drdhaval2785 commented 3 years ago

construct x.txt in the metaline format.

This is a tricky portion. Sanskrit dictionaries are different in structure than western ones. I will open a couple of different issues for deciding format of meta-line for Sanskrit koshas.

drdhaval2785 commented 3 years ago

Get a set of scanned images of the dictionary -- one image (pdf) per page.

Will arrange for it. Not that difficult. Just need to split a pdf into multiple pdfs.

drdhaval2785 commented 3 years ago

You will need the page numbers for the field of the metaline format of x.txt.

New page is explicitly coded. Example - ;p{0015} to denote start of page 15. So this should be doable.

funderburkjim commented 3 years ago

4 letters?

Yes, sure. EG we have MW72, AP90

funderburkjim commented 3 years ago

split a pdf into multiple pdfs

You will need to name the pdfs in some systematic way. At a later point, you will need to create a pdffiles.txt file that links the pages as in <pc>X to relative file names. For instance, lan pdffiles.txt.

funderburkjim commented 3 years ago

Do you remember where ...

Thanks - that hwnorm1 issue was the one of interest.

drdhaval2785 commented 3 years ago

https://github.com/sanskrit-kosha/kosha/blob/master/abhidhanaratnamala_halayudha/cologne/abhidhanaratnamala.txt is the file of new dictionary in Cologne compliant format. I also have split the PDF pages into single page PDFs. I am not sure where to put the PDF scanned image pages.

funderburkjim commented 3 years ago

Is your dictionary code ABDR? I'll assume it is X below.

Where to put images for dixtionary X:

Three places: 1) in sanskrit-lexicon-scans organization, in new repository 'x' (lower case) 2) At cologne. The location must agree with $cologne_pdfpages_urls table in dictinfo.php file, which appears in two places:

drdhaval2785 commented 3 years ago

Is your dictionary code ABDR?

It is ARMH.
Uppercase to show the generation of dictionary code. AbhidhanaRatnaMala_of_Halayudha. -> ARMH.

drdhaval2785 commented 3 years ago

I went through the process as I understood. My local machine has the dictionary ARMH up and running.

I have noted the whole process for local installation of a new dictionary in https://github.com/sanskrit-lexicon/COLOGNE/blob/master/readme_new_dict_addition.md

I am not sure what additional steps would be required to make it hosted on the Cologne servers. @funderburkjim may update the instructions further.

Then, we will have a new Sanskrit-Sanskrit dictionary at Cologne.

gasyoun commented 3 years ago

Then, we will have a new Sanskrit-Sanskrit dictionary at Cologne.

Hurray, the 3rd one.

drdhaval2785 commented 3 years ago

Yes. And further dictionaries may continue pouring in. It is only a matter of identifying headwords from verses. 25 pages per day seem an OK goal to annotate dictionaries. I could complete a 100 page dictionary annotation of Halayudhakosha in 4 days or so, doing part time.

gasyoun commented 3 years ago

I could complete a 100 page dictionary annotation of Halayudhakosha in 4 days or so, doing part time.

Sounds reasonable. You've done all you can for API and asked all the questions, right?

funderburkjim commented 3 years ago

I don't see the new ARMH in csl-orig.

There are still several steps to complete to get armh completely installed.

Are you planning to completely install armh before proceeding with other dictionaries?

funderburkjim commented 3 years ago

Good to have your documentation.

How did you handle metaline in your local implementation? Please send me a link to your local csl-orig/v02/armh.txt so I can duplicate it on my machine.

There are a few more steps to get full installation.
But let's defer these steps until we agree on metaline issues as discussed in #338 .

funderburkjim commented 3 years ago

Here are some additional files that need to be updated (and have been so updated) for new dictionary armh:

csl-websanlexicon/v02/redo_xampp_all.sh
csl-websanlexicon/v02/redo_cologne_all.sh
csl-websanlexicon/v02/makotemplates/web/webtc/dictinfo.php
csl-apidev/dictinfo.php
csl-apidev/sample/dictnames.js
csl-apidev/simple-search/v1.1/parse_uri.php
hwnorm1/sanhw1/sanhw1.py

And various redo scripts need to be run.

funderburkjim commented 3 years ago

To get armh on the homepage (not yet done by me):

Edit csl-homepage/index_cologne.py csl-homepage/index_xampp.py

Then to install, sh redo_xampp.sh for local xampp installation sh redo_cologne.sh for Cologne installation.

funderburkjim commented 3 years ago

csl-orig/v02/armh/armhheader.xml needs to be filled out.

I've also got an empty file for lanheader.xml that needs to be filled out.

funderburkjim commented 3 years ago

If there are any 'Front matter' pages for armh, they need to be added via csl-doc, and the csl-doc rebuilt with sphinx.

Another step regards scans -
I pulled your images (from sanskrit-lexicon-scans/armh/ -- where you put them) into cologne at scans/ARMHScan/2020/web/ and put them into the pdfpages directory.

As mentioned above, this location has to be added several places for the displays to know where to look.

gasyoun commented 3 years ago

Thanks to @Andhrabharati as a humble gift to @drdhaval2785

If there are any 'Front matter' pages for armh

Abhidhānaratnamālā by Halāyudha (c. 950 AD)

The Abhidhānaratnamālā is a vocabulary of small extent containing about 900 stanzas and is divided into five kāṇḍas or sections as follows: 1. svarkāṇḍa, 2. bhūmikāṇḍa, 3. pātālakāṇḍa, 4. sāmānyakāṇḍa, and 5. anekārthāṇḍa. The first four of these deal with synonyms while the last is devoted to homonyms and the indeclinables. The genders are indicated by giving the declensional forms. The work does not treat of the genders so strictly as the Amarakośa although in other respects it generally follows the latter, and is composed in variety of matters. Halāyudha, the author of the present lexicon, is said to have flourished by the middle of the tenth century. R.G. Bhandarkar,1 identified him with the author of the Kavirahasya, a grammatical work written in honour of king Kṛṣṇa III (c. AD 940-56) of the Rāṣtrakūṭa family.2 Halāyudha is also supposed to be the author of the three works viz. 1. Abhidhānaratnāmālā 2. Kavirahasya, and 3. Mṛtasañjīvanī, a commentary on the Chandaḥsūtras of Piṅgala. The last is said to have been written in the reign of king Muṅja Vākpati of Dhāra. 3 It must be noted here that Aufrecht 4 regards the three Halayudhas as quite distinct and separate persons; while in the India Office Catalogue 5 the authors of the Abhidhānaratnāmālā and the Kavirahasya are regarded as identical and the author of Mṛtasañjīvanī as a different person. Weber, on the other hand, places the author of the present lexicon at the end of the eleventh century. The divergent views regarding the date of Halāyudha and his works are recorded above. For want of detailed information it is not possible at this stage to come to any definite conclusion. In the light of the evidence which is available to us we have to agree with R.G. Bhandarkar and other scholars who place Halāyudha

1 Report in Search of Manuscripts for 1883-84, p. 9. 2 Keith, History of Sanskrit Literature, 133. 3 Kalpadrukośā, Introduction, xxvi. 4 Cat. Cat., i, 764b. 5II, pt. ii, p. 1840.

in the middle of the tenth century and ascribe to him the authorship of the three works mentioned above. Among his authorities Halāyudha mentions Amaradatta, Vararuci, Bhāguri, and Vopālita. 1 So far no commentaries on the Abhidhānaratnāmālā are available either in print or in manuscript form. Aufrecht,2 however, records one commentary by Ajaḍa, which is also recorded by Bühler in his Catalogue of Manuscripts from Gujarat, III (1872), p. 34. One Halāyudhaṭīkā is cited by Vallabhagaṇi in his Sārodhāra, which is itself a commentary on the Abhidhānacintāmaṇi of Hemacandra. 3 It is, however, doubtful whether the reference to the Halāyudhaṭīkā in Vallabhagani's commentary is to the commentary on the Abhidhānaratnāmālā.

Cat. Cat., i, 24; ii, 5; iii, 6; AISM (Madras), nos. 891-5. (nos. 894-5 are said to be the works on medicine. This appears to be doubtful).

funderburkjim commented 3 years ago

Is the worldcat reference the edition that @drdhaval2785 used for scans?

There are several mentions of 'halayudha's kosha' in archive.org.

In all the other dictionaries at Cologne, Thomas started with scanned images from a particular print edition. Then he and his 'Sanskrit typists' made the digitization from the scanned images. And the 'Front matter' consists of the front matter in the particular print edition.

I suspect Dhaval's process was different for armh.

What is the original source of the Devanagari digitization?

Is this source clearly tied to a particular print edition?

The insert above (from history of Sanskrit Lexicography) would also be a good item to put in ARMH's 'front matter' in csl-doc, even though it is not exactly 'front matter'.

gasyoun commented 3 years ago

The insert above (from history of Sanskrit Lexicography) would also be a good item to put in ARMH's 'front matter' in csl-doc, even though it is not exactly 'front matter'.

Exactly, but all the Dhaval's Indian Sanskrit-Sanskrit dictionaries will need one.

Andhrabharati commented 3 years ago

history

@gasyoun Good to see that this book is put to use so quickly.

Andhrabharati commented 3 years ago

And may I mention here thay I am in possession of a vast collection on the subject matter (I would say THE single place, no where else availabe thus)!!

drdhaval2785 commented 3 years ago

csl-orig/v02/armh/armhheader.xml needs to be filled out.

I know. That was not essential, so kept it blank. Will fill it soon.

Is the worldcat reference the edition that @drdhaval2785 used for scans?

No. https://archive.org/details/halayudhakoshajayasankarjoshi1957_134_L/mode/2up is the scan I used for digitization. I could not locate the worldcat reference to this edition of Halayudhakosha.

I suspect Dhaval's process was different for armh. What is the original source of the Devanagari digitization? Is this source clearly tied to a particular print edition?

Dhaval's process was not different. It was based on the print edition linked above. The digitization was done by 'Dhaval and his Sanskrit typists / volunteers'.

csl-doc

For the time being, I have the pdf scan available for the front matter, but I do not have a text file for the same. I have not digitized the prefaces yet. If that is necessary, it can be typed in. Not a very long one. Two pages only.

history of Sanskrit Lexicography

It is good addition to have over and above the regular prefaces.

drdhaval2785 commented 3 years ago

Great to see ARMH working at https://www.sanskrit-lexicon.uni-koeln.de/scans/ARMHScan/2020/web/webtc2/index.php

Still to make it to the home page, but great to see that it is working.

gasyoun commented 3 years ago

I am in possession of a vast collection on the subject matter

I'm ready to share the burden with you ))

If that is necessary, it can be typed in. Not a very long one. Two pages only.

Sure. And an English translation of it?

https://www.sanskrit-lexicon.uni-koeln.de/scans/ARMHScan/2020/web/webtc2/index.php

kṛpīṭayonirdamunāḥ kṛṣṇavartmāśuśukṣaṇiḥ . vibhāvasurapāṃpittaṃ jātavedāstanūnapāt .. 63 ..

Do we really want to see the .. in IAST mode?

drdhaval2785 commented 3 years ago

I do not think that double periods would have any issue.

drdhaval2785 commented 3 years ago

some additional files that need to be updated

May I request @funderburkjim to write it down in https://github.com/sanskrit-lexicon/COLOGNE/blob/master/readme_new_dict_addition.md so that the documentation gets updated at a single place.

Andhrabharati commented 3 years ago

I would suggest adding the preface from the original Aufrecht edition (1861) as well.

This is in line with my comment elsewhere, to have all the "related" information at one place (to the maximum extent possible).

Andhrabharati commented 3 years ago

@drdhaval2785 Probably I should help you in this exercise of "adding new dictionaries at Cologne" as well.

Seen some errors like viprasUna (in the list), visaprasUna (in the original verse)- both for bisaprasUna, in the data.

We could together see that a "good data" is made available to the public (I see no one else that could join hands in the process).

One dictionary one month (or may be every alternate month, keeping other works that we do are not affected much) would be an achievable target, with the original digitised texts in our possession (yours in public domain and mine in a "closed box" as of now).

drdhaval2785 commented 3 years ago

I would be highly obliged if you can help us in the data correction. One dict a month seems a reasonable goal.

Andhrabharati commented 3 years ago

Good, it's a deal now.

Will look at your (raw) data file in your own repo, once I post MW Annexure 1st phase.

And I guess these correction works can be separately done at your repo, and you can process the files further to reach here under Cologne "framework".

In essence, yours would be the place for "data warehousing" & Cologne's would be the place for public presentation.

drdhaval2785 commented 3 years ago

Yes. All corrections occur in my github repo, which you are also a member. Cologne files would be updated via scripts.

gasyoun commented 3 years ago

In essence, yours would be the place for "data warehousing" & Cologne's would be the place for public presentation.

Makes sense.

funderburkjim commented 3 years ago

Dhaval and his Sanskrit typists / volunteers

Good to know about that resource.

I have not digitized the prefaces yet.

csl-doc (based on sphinx) can handle image files (e.g. for BOR).

Probably csl-doc (sphinx) can NOT handle PDF pages.

The syntax for handling images in sphinx is a bit awkward.

request documentation gets updated at a single place.

Items 12-18 added to readme_new_dict_addition as requested

gasyoun commented 1 year ago

@funderburkjim so to add the Russian dictionaries I should try the steps Dhaval did myself? Still one of them is a trilingual dictionary and I do not understand what the markup should actually look like.

funderburkjim commented 1 year ago

what the markup should actually look like.

The first step is to create the 'xxx.txt' file which would go into csl-orig; this is step 2 in 'readme_new_dict_addition' link mentioned above.

The format of xxx.txt can be seen 'by example'.

Take a look at some of the digitizations in csl-orig repository. For example md.txt.

You will need to convert the current form of your dictionary into the xxx.txt form.

This will probably require some guidance from @drdhaval2785 or me.

First, choose which dictionary you want to focus on.

Please provide a link to a pdf of this dictionary.

Also, provide a link to the text file which contains the current form of your digitization. For sake of this discussion, let's call this xxx_orig.txt. Then the first task is to convert xxx_orig.txt to xxx.txt.

If you don't yet have xxx_orig.txt, then typing that xxx_orig.txt is the first step.

Since each dictionary has its own peculiarities, we will need to see a particular dictionary in order to make specific suggestions.