sanskrit-lexicon / PWG

Boehtlingk und Roth Sanskrit Wörterbuch, 7 Bände Petersburg 1855-1875
0 stars 0 forks source link

[INDOLOGY] Upgrade to online Koln Bohtlink-Roth dictionary? #72

Open gasyoun opened 3 months ago

gasyoun commented 3 months ago

@funderburkjim @Andhrabharati the work is started to be noticed! And so I can a question if we can batch get an OCR of the scans on our end with https://ocr.sanskritdictionary.com and with a little help from @martingluckman

"Does anyone know if this is an ongoing project to make all the references in the B-R Grosse Worterbuch live (i.e. point to the actual page of the work referenced). and if this project also extends to other of the Koln on-line dictionaries." - what is the plan and at what URL as of now? What is already covered? Even I miss part of the changelog.

Harry Spier via INDOLOGY
indology@list.indology.info

Dear list members,
I just looked up cint in the Koln online Bohtlink-Roth Grosses Worterbuch.
https://www.sanskrit-lexicon.uni-koeln.de/scans/PWGScan/2020/web/webtc2/index.php 
and I noticed that for the references given for the different formations of cint  listed, for those references to the Mahabharata,the Harivamsa, the Ramayana, the Kathasaritsagara,  but not references to other works, you can download a pdf of the image of the actual page of the work ch with the reference, just by clicking on the reference..Very impressive! I had not noticed that before. 

Does anyone know if this is an ongoing project to make all the references in the B-R Grosse Worterbuch live (i.e. point to the actual page of the work referenced). and if this project also extends to other of the Koln on-line dictionaries.

What makes this especially useful, is that the images are good enough to put in "Sanskrit CR"  https://ocr.sanskritdictionary.com/
and get almost flawless digitization.
Andhrabharati commented 3 months ago

I don't think that's worth spending our time on at CDSL; getting the links to scan pages itself is a big task and Jim has taken up the same with some support from my side.

That OCRing is best left to the interested people (if any!!).

I really doubt if anyone would venture the task and complete even a single book; people are just "making use" of the texts provided by open sources like GRETIL, Sanskritdocuments etc. (with whatever quality/drawbacks that they possess). No further improvement, nor any independent work!!

Andhrabharati commented 3 months ago

And I recall that not even a single step has been taken (at your end, @gasyoun) for "getting" the text out of the front pages of the CDSL works [which is a very practical & achievable task] that was talked about few years back!!

gasyoun commented 3 months ago

I don't think that's worth spending our time on at CDSL

Disagree. Would want to discuss it with @martingluckman at a later stage.

funderburkjim commented 2 months ago

Encouraging to see that this feature of links to references is noticed by Gluckman.

@gasyoun I agree with AB that OCRing (getting the text out of) the Documentation Frontmatter scans would be an upgrade to that section at Cologne.

Andhrabharati commented 2 months ago

Encouraging to see that this feature of links to references is noticed by Gluckman.

@funderburkjim

What Marcis said is that Harry Spier had noticed the linking feature, not Gluckman (whom Marcis wants to approach for helping in OCRing the full-works!)

funderburkjim commented 2 months ago

if this is an ongoing project to make all the references in the B-R Grosse Worterbuch live

Yes, at least for the 'major' PWG references.

I think I've put all of the 'link targets' here: https://github.com/orgs/sanskrit-lexicon-scans/repositories

This repo also contains copies of the scanned images for the dictionaries.

So someone interested in OCRing any of the link targets could clone one of these repos to get images of the individual pages.

funderburkjim commented 2 months ago

In fact, These github repos are also used by cdsl displays (e.g. of PWG) to 'serve' the images.

Andhrabharati commented 2 months ago

Just OCRing can be done practically in no-time these days (courtesy Google); but it is the next phase, i.e. proofing the OCRed text to match the print is the REAL task.

Andhrabharati commented 2 months ago

if this is an ongoing project to make all the references in the B-R Grosse Worterbuch live

Yes, at least for the 'major' PWG references.

Is it not worthy to do this for all the works that exceed a count of 10k (references), in this spree?

And @funderburkjim should update the lsextract_pwg file (which seems to have been last updated on 13th Jan. 2023) again, which will have further members (extending the list that I mentioned at the KSS issue) joining the 10k+ club! -------------------------------------- PS. If the Skt. lexicons are also to be covered, I can prepare 'the index files' for those as well (taking Jim's indexing for AK. as "done").

Andhrabharati commented 2 months ago

And also link the Indische Sprüche (1st ed.) scans, though the 2nd ed. has been already linked as a digital text.

Andhrabharati commented 2 months ago

I don't think that's worth spending our time on at CDSL

Disagree. Would want to discuss it with @martingluckman at a later stage.

I am sure Jim cannot spend any time for this, and I WILL NOT (though I can do the proofing also, iff I take up the work); so you are welcome to get it done by any interested party, @gasyoun !!

gasyoun commented 2 months ago

@Andhrabharati I'm speaking of a dirty OCR, nonproofed

proofing the OCRed text to match the print is the REAL task.

Andhrabharati commented 2 months ago

A simple script will do it, @gasyoun!

[And quite many of them are floating across the net.]

Andhrabharati commented 2 months ago

Looks like Suśruta, 1835-6 is the only other candidate coming into the 10k+ club!

Once this 'bound book" is split into two constituent volumes [Vol.1 (1835): 378pp and Vol.2 (1836): 562pp, leaving the front 4 "title" pages in each volume], there is no need for any indexing for this work-- as the references are just in the (volume,page,line) manner.

Very easy for Jim, just like in the case of the Verz. d. Oxf. H.!!

gasyoun commented 2 months ago

[And quite many of them are floating across the net.]

Never seen one @Andhrabharati

Looks like Suśruta, 1835-6 is the only other candidate coming into the 10k+ club!

Where are the others?

Andhrabharati commented 2 months ago

[And quite many of them are floating across the net.]

Never seen one @Andhrabharati

Well, not everyone need to know everything! You may just use the places like wikisource, ocr.sanskritdictionary.com, ambuda.org etc.

Looks like Suśruta, 1835-6 is the only other candidate coming into the 10k+ club!

Where are the others?

You mean the list of names? Look at my post above! If it is about the scans, they would come when Jim starts working for them!