suttacentral / legacy-suttacentral

Source code and related files (CSS, images, etc.) for SuttaCentral
http://suttacentral.net/
Other
14 stars 4 forks source link

CBETA2SC #93

Closed sujato closed 8 years ago

sujato commented 9 years ago

It would be nice to have an automatic CBETA 2 Suttacentral converter. We have the main CBETA texts on SC, but some are missing—Abhidhamma and Mula-sarv Vinaya, mostly. In addition, CBETA correct and improve their text from time to time so it would be nice if we had a semi-automatic process for adapting them.

CBETA is in nice, TEI XML so the conversion should not be too hard. There are a few details that are a little tricky; for example, I have been substituting the correct monospaced Chinese punctuation for CBETA's punctuation, which uses standard Western style. Getting the various notes, special characters, and variant readings is a little nuanced.

But the real killer is the structure. CBETA structures texts according to the juan (folio) which is useless for us. I've been breaking all the texts by hand into the proper semantic structures, but CBETA unaccountably doesn't have any clear markup for doing this. I will be suggesting that they introduce it, but until they do it is unclear how this could be automated.

yapcheahshen commented 9 years ago

Dear Bhante, I had hard time with processing Cbeta 2014 (TEI P5) , it has been converted to a simpler format for searching on iOS and Android. https://github.com/ksanaforge/cbeta2014 CBETA has <p> tag with unique id for about 30% of texts, we can be served as semantic unit, but some are too small (like <p>evam-me-suttam</p> ), and some are too big, I don't know why and didn't get a clear answer from them.

Now I am trying to do automatic sentence level Tibetan Kangyur to Chinese Tripitaka, first we need to break down into smallest possible unit of sentence (in Chinese we have the full-width comma and full-stop, in Tibetan we have Shad), then we comparing the key terms of each sentences and guess the best possible alignment. But we need to make sure that the translated text is faithful to the source text, so we need to go back to Sanskrit and Pali anyway.

sujato commented 9 years ago

A couple of questions:

  1. With the simplified format, what information have you lost? More generally, how did you do the conversion, and to what end?
  2. It would be incredible if we could match at a sentence level across all texts. Blake has written, and this is an evolving project, a parser for segmenting the Pali text at a sentence level (more or less; it includes other punctuation breaks, too). This is easily tweaked. We'll be using the segmented text to produce a translation, hopefully a set of translations, all segmented in the same way, using gettest po markup. I hadn't even given any thought to matching by sentence across Pali/Tibetan/Chinese/Sanskrit, but we would definitely love to help out with developing and implementing this.
yapcheahshen commented 9 years ago

The pure null tags format and TEI is fully interchangeable, no information is lost. but with 2 "defects" in certain extend: a) Cannot apply CSS to it directly, need to convert back to normal XML before rendering. b) It will pass though all validation (end tag without open tag, or open tag without end tag, or nested tags, end tag with attribute ), therefore a well-formatted null tags XML can result an ill-formatted TEI/XML. the conversion is quite simple and can be done in one-pass (using SAX) , convert <p>to<p/> and </p>to<p_end/>
if there are no "hole" in between <p>, which means the next sibling of a <p> is always another <p>. then is not required.

I use null tags as an interchange format for XML and explicit markup format, which enable users to add annotations, amendments, inter-textual links (links between any range of text) without touching the underlying text.

2) Is it on github? it will be really great if we can do machine semi-translation, or show definition ,declension, conjugation beneath the unfamiliar words( reader provide a growing list of known words) I discussed the idea with Bhante Anandajoti but we are not able to make progress as extensive groundwork is required, e.g, expanding the peyyala in different version, unified semantic paragraph id and so on.

sujato commented 9 years ago

The python scripts are in suttacentral/utility/bin. You can see discussion here: https://discourse.suttacentral.net/t/wishlist-for-virtaal/329/21

Currently our primary purpose is so that I can create a new translation of the nikayas. So our development is focussed on that. The other things you mention would all be good, but if we try to do too much we will get bogged down.

We will be trying to get some sort of dictionary lookup working, with grammar analysis. However this is not such a priority, as I know Pali fairly well so mostly I don't need it. In the longer term it would be wonderful.

As for expanding peyyalas, again this would be for the longer term. It's too complex to work out on a case by case basis; Anandajoti spent a lot of time just sorting out the Satipatthanavibhanga, for example. Doing something like the Patthana would be endless, and anyway, a fully expanded Patthana would probably fill the internet! For the most part I will simply be translating the text as I find it, and leaving further enhancements for now.

As for paragraph IDs we are using the Mahasangiti system, which is very systematic and well thought out. Of course we will need a way of labeling each segment as well.

yapcheahshen commented 9 years ago

My main concern is to provide a computer aided learning environment of core text for lay Buddhist, especially Chinese with Mahayana background like me. I can foresee many groundworks can be shared even we have quite different goal. Do you have more explanation of the Mahasangiti system ? what are the rules when you perform further segmentation?

2015-05-12 8:00 GMT+08:00 sujato notifications@github.com:

The python scripts are in suttacentral/utility/bin

Currently our primary purpose is so that I can create a new translation of the nikayas. So our development is focussed on that. The other things you mention would all be good, but if we try to do too much we will get bogged down.

We will be trying to get some sort of dictionary lookup working, with grammar analysis. However this is not such a priority, as I know Pali fairly well so mostly I don't need it. In the longer term it would be wonderful.

As for expanding peyyalas, again this would be for the longer term. It's too complex to work out on a case by case basis; Anandajoti spent a lot of time just sorting out the Satipatthanavibhanga, for example. Doing something like the Patthana would be endless, and anyway, a fully expanded Patthana would probably fill the internet! For the most part I will simply be translating the text as I find it, and leaving further enhancements for now.

As for paragraph IDs we are using the Mahasangiti system, which is very systematic and well thought out. Of course we will need a way of labeling each segment as well.

— Reply to this email directly or view it on GitHub https://github.com/suttacentral/suttacentral/issues/93#issuecomment-101077563 .

sujato commented 9 years ago

To understand the Mahasangiti system, best just open up one of our Pali text files and have a look. Not all of their reference data is displayed on SC.

In the longer term, we would love to introduce more capacities for helping people from different backgrounds. Of course, as our background is English, we look to the needs of people from a similar background, but this is a regrettable bias, not a policy!

The segmenting is discussed further in the discourse thread i posted earlier. Basically we segment on block level HTML elements, as well as the following punctuation: . ; : — ? ! Since the Mahasangiti text is extremely well proofread we can get a consistent result easily.