sanskrit-lexicon / COLOGNE

Development of http://www.sanskrit-lexicon.uni-koeln.de/
18 stars 3 forks source link

Integrate Cologne dictionaries with stardict-updater #98

Closed drdhaval2785 closed 3 years ago

drdhaval2785 commented 7 years ago

It is a bit ambitious. It would be great if we can convert all dictionaries (of course in descending order of importance) to babylon format. We will then be able to use them as stardict files. I personally use it on mobile on colordict and am deeply in love with the ease. https://github.com/sanskrit-coders/stardict-sanskrit/

Babylon format is simple enough format.

Line 1 - headwords separated by |
Line 2 - Dictionary entry
Line 3 - line break

Currently stardict-dictionary-updater project uses some of the dictionary data from Cologne (scraped maybe some years back). This means their users are not benefiting from corrections being made day in and day out at Cologne servers.

If we integrate seamlessly, both will benefit.

  1. Dictionaries will have wider circulation (also on android / mobiles etc)
  2. Dictionaries will be better updated.

@gasyoun and @funderburkjim What do you think about this?

Not much of work compared to the benifit.

drdhaval2785 commented 7 years ago

We can make this babylon generation stuff a shell script which can regenerate babylon files based on current DICT.xml file from our 36 dictionaries.

funderburkjim commented 7 years ago

I like the idea.

Are there some details regarding the format of the dictionary entry ?

Not sure what 'page break' means --- given that it pertains to headword entries, it probably does not literally refer to 'page' breaks.

Coincidentally, I got this email from someone who is a Mac user:

I saw someone who had loaded the MW data into the resident Mac dictionary.  What file would I use to load into it?  What format?  There are a number of formats available online on the website.

If you have a set of instructions on how to do this, could you pass them onto me?

Would the conversion you are talking about apply ?

drdhaval2785 commented 7 years ago

On 24 Feb 2017 05:57, "funderburkjim" notifications@github.com wrote:

I like the idea.

Are there some details regarding the format of the dictionary entry ?

Stardict format is what it is called. Vishvas has a convertor which converts babylon files to stardicts with some Sanskrit specific preprocessing like converting headwords to ITRANS transliteration too, so that search via Devanagari and roman letters both are possible.

Not sure what 'page break' means --- given that it pertains to headword entries, it probably does not literally refer to 'page' breaks.

My error. Read line break.

Coincidentally, I got this email from someone who is a Mac user:

I saw someone who had loaded the MW data into the resident Mac dictionary. What file would I use to load into it? What format? There are a number of formats available online on the website.

If you have a set of instructions on how to do this, could you pass them onto me?

Would the conversion you are talking about apply ?

Yes. All he needs to have is a stardict dictionary reader. Colordict or goldendict. Currently I have nearly 44 dictionaries etc loaded on my mob. It is damn fast. And best thing is it works without internet too.

This format is so versatile that it can be used in non dictionary kind books too. E.g. Paninian grammar books can have sUtra|sUtranumber instead of headwords and explanation instead of dictionary entry. Then I can type 1.3.1 and see all commentaries on that sUtra.

If you agree, I shall try create some generic script to generate babylon files from .xml files of cologne. Mostly xpaths would vary. They can be passed as arguments.

funderburkjim commented 7 years ago

If you agree, I shall try create some generic script to generate babylon files from .xml files of cologne

I definitely agree. Maybe start with a 'small' dictionary, say McDonnell. Don't worry too much about generic at first. You obviously know a lot more about this than I do. Seeing the details of a specific example will help me get up to speed.

funderburkjim commented 7 years ago

SOme time ago, Gerard Huet mentioned that he done something along these lines.

Here is link to his GoldenDict page.

drdhaval2785 commented 7 years ago

Yes his dict is available in updater.

funderburkjim commented 7 years ago

Q: what is 'updater'?

funderburkjim commented 7 years ago

A search for 'python stardict' gives several resources. Most appear to be for parsing a stardict file.

This one appears to deal with creating stardict file:

https://pypi.python.org/pypi/penelope/3.0.0.1

It might require an ubuntu-type setup to fully use.

drdhaval2785 commented 7 years ago

https://play.google.com/store/apps/details?id=sanskritcode.sanskritdictionaryupdater&hl=en

Gave the link in first post.

It in a way updates stardict files in your local machine whenever you click on it.

gasyoun commented 7 years ago

Not much of work compared to the benifit.

Agree. Do not see many possible issues.

It is damn fast.

Sure that's important. Is word order devanagarish?

I shall try create some generic script to generate babylon files from .xml files of cologne. Mostly xpaths would vary. They can be passed as arguments.

Makes sense to me.

funderburkjim commented 7 years ago

I came across a project ca April 2008 where an mdict version of MW was created. This was for the now obsolete windows mobile phone. If anyone thinks it might be of interest as a model for current conversions, here is mdict.zip. It is about 19mb. The programs are written in Perl.

funderburkjim commented 7 years ago

Here is readme.txt for mdict project.

gasyoun commented 7 years ago

As per the babylon format, @drdhaval2785 I found out that there are people who use a MW version from 2004, that is split like documented bellow. It is similar to Babylon in some way I guess?


------------------------------------
===> [dha]1[dha]1 aspirate of the preceding letter , 
------------------------------------
===> [dhakAra]3[dha--kAra] m. the letter or sound [dh] 
------------------------------------
===> [dha]1[dha]2 mf([A])n. (root 1. [dhA] 

---> cf.2. [dhA]) ifc. placing , putting 

---> holding , possessing , having 

---> bestowing , granting , causing &c. (cf. [a-doma-dha] , [garbha-dha]) 

---> m. N. of Brahmaa or Kubera cf. L 

---> (in music) the 6th note of the gamut 

---> virtue , merit cf. L 

---> n. wealth , property cf. L 

---> ([A]) , f. in 2. [tiro-dhA] 

---> [dur-dhA] (qq. vv.) 
------------------------------------
===> [dhak]1[dhak]1 nom. fr. [dagh] or [dah] (cf. [dakSiNa-dagh] and [uza-dah]) 
------------------------------------
===> [dhak]1[dhak]2 an exclamation of wrath cf. Uttarar. iv , 23 
funderburkjim commented 7 years ago

Is there an official documentation of the stardict format?

I have done several searches but have not found a definitive specification.

drdhaval2785 commented 7 years ago

Automatized way

If you prefer to use auto-updater on android phone, follow these steps. Example presumes that you want to download MacDonell dictionary.

  1. Install Stardict Dictionary Updater from play store.
  2. Install colordict from play store.
  3. Open Stardict Dictionary Updater on phone.
  4. Select only sanskrit-coders/stardict/sanskrit/master/sa-head/en-entries. Deselect others.
  5. Click PROCEED.
  6. In next step, keep macdonelltimestampsize.tar.gz select. Deselect others.
  7. Click PROCEED.
  8. See for the log. Usually it should say - Succeeded on ......
  9. QUIT the application.
  10. Open Colordict.
  11. Click on Folder structure on the right upper corner.
  12. Click on three vertical dots on right upper corner.
  13. Click on Reindex.
  14. It will reindex dictionary for you.
  15. Once it is over, the dictionary is ready for you to use. Write the headword you want to see.

Manual way

https://github.com/sanskrit-coders/stardict-sanskrit/tree/master/sa-head/en-entries/macdonell has all the files, if you want to test them out separately in stardict / colordict for computers.

Where are other dictionaries

Navigate to various subfolders of https://github.com/sanskrit-coders/stardict-sanskrit/.

  1. sa-head holds dictionaries which have Sanskrit headword.
  2. en-head holds dictionaries which have English headword.
  3. sa-Ayurveda, sa-vyAkaraNa hold the dictionaries / books from those specialities.
gasyoun commented 7 years ago

I am able to see the dictionary working well on my android phone colordict app.

Hurray, congratulations, Dhaval! Let me test it and make a video even, so others can follow as well.

drdhaval2785 commented 7 years ago

@funderburkjim

NOTE

MD, BEN, AP90 and ACC are added in stardict updater. Please fillow above steps and check out the results. There are a lot of improvements possible. Post an issue for bug / feature.

funderburkjim commented 7 years ago

@drdhaval2785 I got 'Failed on macdonell__2017-04-01_10-43-35.tar'

Do I need to have an sd card (I'm using a ZenPad android tablet)?

drdhaval2785 commented 7 years ago

Check in

Internal storage/dictdata/macdonell

Do you have any files or not?

If there are files there, try to do reindexing. It sometimes get resolved on reindexing.

funderburkjim commented 7 years ago

I see internal storage >root>sdcard>dictdata, but no files.

Since the Instructions for updater mentioned SDCARD, it must be that an SDCARD is needed.

I'll have to get one, and try again.

drdhaval2785 commented 7 years ago

I dont have SD card and it functions.

drdhaval2785 commented 7 years ago

BHS, BOP, BOR, BUR added now.

drdhaval2785 commented 7 years ago

CAE, CCS, GRA, GST, IEG, INM, MD, MW72, MWE added.

KRM, MCI, MW, PE were already there. Will need to examine the converted version v/s current cologne version. Will handle them later on.

drdhaval2785 commented 7 years ago

PD can't be converted / uploaded because of copyright issue.

gasyoun commented 7 years ago

PD can't be converted

Can be converted, but not for public access.

funderburkjim commented 7 years ago

Similarly for 'AP' as for PD - not publicly available.

funderburkjim commented 7 years ago

I finally got an sdcard. However, still get 'Failed on macdonell...' 😢

I was successful in getting the English wordnet dictionary in Colordict.

When I do the download step for MD, it takes a while (WORKING), and then gives the message "We're done. Get a stardict... Then reindex all your dictionaries' Failed on Macdonell...

In the file system, root/sdcard/dictdata has a bunch of files, which all seem to be referring to English wordnet dictionary.

drdhaval2785 commented 7 years ago

Try something other than MD. Try AP90.

drdhaval2785 commented 7 years ago

I tried on my phone. Downloaded well.

drdhaval2785 commented 7 years ago

Also update your stardict dictionary updater to the latest version.

funderburkjim commented 7 years ago

There may be some subtle difference in my particular device - I have tried another (benfey) with same results.

Could I download the download files to my pc, and then somehow transfer them to the ZenPad --- is such a workaround possibe? If so, what would be the url for the file(s)?

funderburkjim commented 7 years ago

How do I update the stardict dictionary updater --- I'm ignorant of how to do things in Android.

drdhaval2785 commented 7 years ago

How do I update the stardict dictionary updater --- I'm ignorant of how to do things in Android.

  1. Go to play store
  2. Search for stardict dictionary updater.
  3. Click update.
drdhaval2785 commented 7 years ago

Why I want you to update is that with latest version, error log is generated for debugging. See https://github.com/sanskrit-coders/stardict-dictionary-updater/issues/11 .

You can send crash report like this https://m.youtube.com/watch?feature=youtu.be&v=whpoaqRbb-A.

This will help improve things for similar device holders.

drdhaval2785 commented 7 years ago

https://github.com/sanskrit-coders/stardict-sanskrit?files=1

This is the place where data is stored. Most data is in sa-head and en-head subdirectory.

The structure is visible in stardict updater interface also.

https://github.com/sanskrit-coders/stardict-sanskrit/tree/ master/sa-head/en-entries/macdonell is specific for MacDonell.

Download dict.dz, idx, ifo, sym files.

funderburkjim commented 7 years ago

Updated the updater. - Got the one dated today.

Now Benfey loads successfully. Hurray!

BTW. Saw something that looks like an ad on the Colordict page ('phone is in danger virsues found...blah blah..') Is that normal with Colordict? With all Android apps?

drdhaval2785 commented 7 years ago

It is ad. Ignore bottom ads. That is the only drawback of colordict. In-app ads.

drdhaval2785 commented 7 years ago

The best way to overcome adds is to turn off internet. I do it often.

Goldendict free supports only 5 dicts. Goldendict paid is paid. So colordict remains a free option with unlimited dict support, with a small hitch of adds.

funderburkjim commented 7 years ago

Got it -- may get the paid version --

But this is just a detail. Main thing is that the dictionaries are there.

Way to go, Dhaval !!

One thing to keep in mind is that it would be possible to make an xml form that is optimized for stardict. Don't know whether this will be necessary, but we should be aware it is an option.

drdhaval2785 commented 7 years ago

I am planning to convert all existing dictionaries to stardict version first. Then we will examine every dictionary syatematically for the drawbacks or features required to modify it to look better or more user friendly. At that time we can decide whether to do it in XML step (your side) or babylon step (my side) depending on feasibility.

The best scenario I envisage is to have a uniform DTD for all cologne dictionaries, so that XMLs are all predictable. Then I will not have to maintain separate patches for separate dictionaries. But that is currently way far off. So I am happy having different patches for different dictionaries.

Also the encoding should be uniform in dictionaries. XYZ tag will hold information in SLP1 only. ABC tag will hold information in IAST only etc.

funderburkjim commented 7 years ago

I'm definitely in agreement with the standardization goals you mention. This would help not only in your stardict constructions, but would also simplify the maintenance of the Cologne system.

One DTD to rule them all ! (goofy reference to Lord of the rings 'One ring to rule them all') :)

vvasuki commented 7 years ago

The best scenario I envisage is to have a uniform DTD for all cologne dictionaries, so that XMLs are all predictable.

+1!

gasyoun commented 7 years ago

But that is currently way far off. So I am happy having different patches for different dictionaries.

Exactly.

I'm definitely in agreement with the standardization goals you mention

It's about 8-10 months away at best, and some unique dictionary features will get lost. But sure I'm for it. If we will have list of what has become what. Before and after.

vvasuki commented 7 years ago

The best scenario I envisage is to have a uniform DTD for all cologne dictionaries, so that XMLs are all predictable.

A few suggestions while doing this:

drdhaval2785 commented 7 years ago

@funderburkjim and @gasyoun This comment https://github.com/sanskrit-coders/stardict-sanskrit/issues/18#issue-209950422 summarizes present state of affairs of stardict conversion from Cologne dictionaries.

Stats Total - 36 Added - 26 Copyrighted - 02 Ready - 06 (There are visually enhanced versions already in use in stardict format.) Pending - 02

vvasuki commented 7 years ago

Dream come true! Thanks, @drdhaval2785

vvasuki commented 7 years ago

Ready - 06 (There are visually enhanced versions already in use in stardict format.)

@drdhaval2785 Could you list these, so that we can be sure?

vvasuki commented 7 years ago

@drdhaval2785 Could you list these, so that we can be sure?

Never mind - just saw the table here: https://github.com/sanskrit-coders/stardict-sanskrit/issues/18

drdhaval2785 commented 7 years ago

@gasyoun and @funderburkjim

Just to give you an idea about the popularity of the dictionary downloads They were uploaded on 17-Apr-2017 and till 23-Apr-2017, the download counts are as below

apte-1890__2017-04-02_23-18-23.tar.gz: 48 downloads
aufrecht-catalogus-catalogorum__2017-04-02_21-28-09.tar.gz: 23 downloads
benfey__2017-04-03_10-26-12.tar.gz: 35 downloads
Bohtlingk-and-Roth-Grosses-Petersburger-Worterbuch__2017-04-17_00-33-12__15MB.tar.gz: 22 downloads
Bohtlingk-Sanskrit-Worterbuch-in-kurzerer-Fassung__2017-04-17_07-13-34__7MB.tar.gz: 27 downloads
bopp__2017-04-04_08-43-21.tar.gz: 23 downloads
borooah__2017-04-03_23-51-58.tar.gz: 27 downloads
burnouf__2017-04-04_08-43-21.tar.gz: 22 downloads
capeller-sanskrit-english__2017-04-05_08-58-12.tar.gz: 26 downloads
capeller-sanskrit-german__2017-04-05_08-58-12.tar.gz: 23 downloads
edgerton-buddhist-hybrid__2017-04-04_08-43-21.tar.gz: 23 downloads
goldstucker__2017-04-05_08-58-12.tar.gz: 24 downloads
grassman-sanskrit-german__2017-04-05_08-58-12.tar.gz: 22 downloads
index-names-mahabharata__2017-04-05_18-52-42.tar.gz: 3 downloads
indian-epigraphical-glossary__2017-04-05_08-58-12.tar.gz: 39 downloads
kalpadruma-sa__2017-04-02_20-23-08.tar.gz: 3 downloads
macdonell__2017-04-05_08-58-12.tar.gz: 24 downloads
Meulenbeld-Sanskrit-Names-of-Plants__2017-04-17_08-29-39__0MB.tar.gz: 32 downloads
mw-1872__2017-04-05_08-58-12.tar.gz: 43 downloads
Personal-and-Geographical-Names-in-the-Gupta-Inscriptions__2017-04-16_22-21-43__0.tar.gz: 25 downloads
Schmidt-Nachtrage-zum-Sanskrit-Worterbuch__2017-04-17_07-16-11__1MB.tar.gz: 22 downloads
shabda-sAgara__2017-04-16_23-42-35__2.tar.gz: 47 downloads
Stchoupak-Sanscrit-French__2017-04-17_09-00-19__1MB.tar.gz: 24 downloads
vAchaspatyam-sa__2017-04-17_07-29-20__12MB.tar.gz: 22 downloads
Vedic-Index-of-Names-and-Subjects__2017-04-17_00-20-32__0.tar.gz: 26 downloads
wilson__2017-04-17_00-07-31__2.tar.gz: 25 downloads
yates__2017-04-17_00-12-54__1.tar.gz: 26 downloads
gasyoun commented 7 years ago

Just to give you an idea about the popularity of the dictionary downloads

Thanks, there seem to be 30 more junkies around :100: