sanskrit / raw_etexts

https://sanskrit.github.io/projects/text/
44 stars 29 forks source link

haravijaya proofreading #38

Open ppasedach opened 3 years ago

ppasedach commented 3 years ago

I would like to correct a few cantos of the Haravijaya OCR text, if that is welcome. I would start with canto 49, which is the last one missing in the incomplete e-text produced many years ago by Diwakar and Rabi Acharya. If it goes well, I would in due course correct a few more cantos. I am planning to then convert them to IAST, and integrate them into my electronic digital critical edition of the Haravijaya (work in progress of course). Anyway, I have started with the first few verses, but am not sure how I should encode the footnote markers, which, I am afraid, usually break in OCR. Should I maybe just leave them out, and whoever is interested in the variant readings will have to consult the scan of the edition, or later the electronic critical edition, anyways? Adding the numbers in the running text would make these specific words ungreppable sort of. Or do you have any convention for that?

Selection_087

वक्रारविन्दनिमितं करगाढसैद्ध-
पार्श्वद्वयं युधि बभार स पाञ्चजन्यम् ।
वैरिञ्चमण्डमिव निःश्वसितानिलोल-
पर्यस्यमानमुदरान्तरतो विनिर्यत् ॥ २ ॥

Here the ru together with footnote marker 3 was OCRed as sai.

vvasuki commented 3 years ago

Namaste!

Very happy to know that you are proofreading the text. I've converted the text to markdown and separated chapters into sections for convenience. Please keep sending periodic pull requests.

Of course, leaving out variant readings is an option. Otherwise you can do one of the following:

I personally prefer this modification of convention followed at sanskritdocuments website:

स-शङ्ख-चक्रं सकिरीट-कुण्डलं  
सपीत-वस्त्रं सरसी-रुहेक्षणम् ।  
सहारवक्षःस्थल-कौस्तुभश्रियं+++(var  स्थलशोभिकौस्तुभं)+++    
नमामि विष्णुं शिरसा चतुर्-भुजम् ॥ ६॥  

This renders as image

vvasuki commented 3 years ago

Looks like TEI was adapted and that the text is available at - https://github.com/ppasedach/ratnakara-tei.git

ppasedach commented 3 years ago

Looks like TEI was adapted and that the text is available at - https://github.com/ppasedach/ratnakara-tei.git

No, this is not what has happened. I did not yet get to further working on your OCRed text. What you see in the ratnakara-tei repository, or, properly displayed using Charles Li's upama engine is for the major part an old e-text produced by Diwakar and Rabi Acharya. I converted it from velthuis encoding to IAST, and added TEI markup. But it lacks the commentary, and a few cantos. Some other cantos have been recently typed in from various manuscript sources, which is an ongoing process.

Particularly for those cantos missing in the old e-text I should sometime soon create something similar from your raw file, and I'd then like to do that in such a way that corrections which are made can then be reintegrated into your repository, which is one reason that has stopped me from doing it so far. It is much easier to just perform some conversions and corrections on a piece of text, and forget about the original source. If one wants to incorporate the changes to the original, one will need a more thought-out approach.

vvasuki commented 3 years ago

Particularly for those cantos missing in the old e-text I should sometime soon create something similar from your raw file, and I'd then like to do that in such a way that corrections which are made can then be reintegrated into your repository, which is one reason that has stopped me from doing it so far.

Ah I see - so I presume that you will add the missing canto-s to your TEI repo, and we can then use our regular TEI-to-markdown scripts to update our text. Please update this thread to notify me once this can be done. Curious to know your name, BTW.

ppasedach commented 3 years ago

You can call me Peter. https://www.aai.uni-hamburg.de/indtib/personen/pasedach.html . Yes, that would probably be an easier approach, at least on my end. But my TEI will be encoded as IAST, if that's not a problem for you? In Upama you can switch to Devanāgarī display though, but I'm afraid not for export. Do you actually train your OCR with corrections?

vvasuki commented 3 years ago

You can call me Peter. https://www.aai.uni-hamburg.de/indtib/personen/pasedach.html .

Pleased to e-meet you!

Yes, that would probably be an easier approach, at least on my end. But my TEI will be encoded as IAST, if that's not a problem for you?

No problem - my script will transliterate.

Do you actually train your OCR with corrections?

No - just whatever I get with Google Vision or Google Drive.