suttacentral / pali

Pali Source Files
7 stars 6 forks source link

Source of the files #1

Open dxcore35 opened 6 years ago

dxcore35 commented 6 years ago

worldtipitaka.org

The page not exist anymore. What is the source of worldtipitaka.org tipitaka on that website? Without explicit origin of this source the data are not trustable.

It is version from:

or it is different romanized source? Please add it into description.

sujato commented 6 years ago

In our Pali text files, we have the following information:

Pali text from the Mahāsaṅgīti Tipiṭaka Buddhavasse 2500: World Tipiṭaka Edition in Roman Script. Edited and published by The M.L. Maniratana Bunnag Dhamma Society Fund, 2005. Based on the digital edition of the Chaṭṭha Saṅgāyana published by the Vipassana Research Institute, with corrections and proofreading by the Dhamma Society.

We also have a short discussion on our "Methodology" page:

https://suttacentral.net/methodology#item4

The Dhamma Society had an extensive digital infastructure, however most of that disappeared when they fell apart. There are, however, some digital relics of their presence if you dig around a little:

https://en.wikipedia.org/wiki/Dhamma_Society_Fund https://www.flickr.com/photos/dhammasociety/ https://buddhavasse.blogspot.com/

Excerpts from their volumes and various essays and so on may still be found on Scribd:

https://www.scribd.com/user/22522924/dhammasociety/uploads https://www.scribd.com/user/18550688/tipitakastudies/uploads

dxcore35 commented 6 years ago

Thank you for reply. I'm always amazed how much resources and energy is spend on some project and later the project just disappear ... We really cannot really on that data if they just turn of the website with all the data just like that. What one can expect inside the book, if they behave so strangely...

Vippasana research institute cannot be trusted as well. - I know as Theravada insider. Also if found this blog and it is more bad than I previously think . [(http://blog.tipitaka.de/2016/03/)]

For me, now, there isn't any trustable electronic tipitaka

So I continue to really on:

Chatthasangiti Pitaka Romanized from Myanmar version printed in 200x © BuddhasAsana Society First published in 200x by Ministry of Religious Affairs, Myanmar

And I'm planing to do digitalization to HTML, EPUB, MOBI, by myself.

sujato commented 6 years ago

It's true, there are many such projects. Sometimes it seems that the main object is to generate prestige and fame, and creating a lasting text is merely incidental. My understanding—and it is no more than an informed guess—is that the Dhamma Society fell apart due to internal tensions, possibly linked with political shifts in the Thai royal family.

Thanks for the link, it is an interesting discussion. Glad to see that others are using Github for version control of Tipitaka!

I have made a small comparison between the VRI and Mahasangiti editions, and, while neither is perfect, the Mahasangiti is better.

Could you give me some more details about the edition you are interested in? Please keep us informed as to your progress!

Also, FYI, I believe that we should be taking an entirely different approach to Pali editions. Rather than digitizing printed editions, and inherently relying on the editors of those books, we should be digitizing manuscripts, the older the better. In digital texts there is no need to reconcile readings or create one unified edition; we can simply digitize each manuscript and use diffing software to reveal any differences. The texts themselves should be entirely plain text, without any markup. We have completed one small edition on these lines—fragments of the Cullavagga from a 9th century manuscript—and are putting together a more ambitious edition, a 13th century Cullavagga manuscript.

dxcore35 commented 6 years ago

As I said I'm planning to digitalize the existing PDF into HTML and then to MOBI, EPUB, etc... One trustable Bhante told me that that Burmese government was so little bit disgusted with already existing versions of Tipitaka in Romanized script (PTS, Sri Lanka project, VRS) that they Romanize their Burmese version. If you look inside the book the Burmese are so pedant and they have such a respect to Dharma, that if there is any inconsistency between Burmese version and other versions (eg. Thai, Sri Lanka, Laos, etc..) they are writing this difference as footnote!

01ViT01.pdf

This cannot be seen in any existing project. And they did it without any fanfares, big international conferences and frankly and without any wish for fame.... Many westerner buddhist are even not aware that those transcript exists :D But they blindly use the VRS or PTS without blinking of the eye.

I have all Tipitaka books (romanized Burmese edition) >100 PDFs. So to finish this project for one person takes time :/ I have finally manage to recognize the Pali characters in OCR software so I expect that it will take much less time than I counted. But footnotes and static page number information is more tricky to make. I would also like to recreate the PDF from html and compare it agains the original romanized version but this action takes so much time.

With scaning of the old manusript of course it will be so noble task to do, but I think both of us are lacking human resources. And I like to create something what can be used in near future.

sujato commented 6 years ago

Oh, very nice. It looks like a great initiative. Getting reliable OCR is not easy! Congratulations!

I'd encourage you to join our forum: https://discourse.suttacentral.net/ I'm sure people will be interested, and you may well find volunteers to help. We do!

If you're interested, have a look at my ideas for modern digital editions:

https://discourse.suttacentral.net/t/on-the-very-idea-of-a-critical-edition-of-the-pali-canon/3511 https://discourse.suttacentral.net/t/how-to-use-git-for-digitizing-manuscripts-rather-than-creating-a-critical-edition/3547

The key to this is using numbered semantic segments. Once texts are segmented the same way, the texts can be kept entirely separate from metadata like markup, notes, variant readings and the like. The same body of metadata can then be easily applied to any edition. This is basically the only clean way of solving the problems that you're encountering, with footnotes and the like. HTML/XML can present these things, but since the metadata is mixed with the main content, it is always unsatisfactory.

benlawraus commented 5 years ago

Hi dxcore35, Looking at your attached pdf it looks like is is searchable. This means you can use tools existing in python to extract the text without OCR. Am I mistaken? I’m out of the country at the moment but when I get back I’ll have a go at writing a simple script for doing this.