JSON full data structure

sujato commented 9 years ago

We've discussed and pretty much agreed that we should switch SC's data format to JSON. Part of this will be developing a master-hierarchy, with all the vaggas, pannasas, and whatever included. In some cases this will include the Chinese texts.

SC is built from the bottom up, with our main structural unit being the sutta with parallels. Most Tipitaka sites are built from the top down, starting with the structures (eg. pitaka>nikaya>pannasa>vagga, etc.) and then the sutta at the end. Much of this canonical structure is hidden on SC. It should be explicit, and provide a further aid for navigation.

We usually reference suttas simply by their name or ID. However, in traditional pali studies it is conventional to identify a sutta giving the whole structural set. The difference makes sense if you think of the difference between a book/manuscript and a website. We have built SC so that we minimize the navigation levels to get to a sutta. But with a book, you first identify the pitaka or nikaya, then the volume that you want (pannasa) and then the vagga, finally the sutta.

This structure should be embedded consistently in SC, whereas currently it is not. In the MN division table, for example, we have vaggas, but not pannasas. For SN we have the samyutta and the small vaggas, but not the big vaggas. And so on.

As well as being available on the division table, we can also consider making this structure available as a breadcrumbs list (compare AtI, http://www.accesstoinsight.org/tipitaka/an/an03/an03.099.than.html). I'd rather keep it off the page, as it will not be of interest for most people, but it could go in the navigation sidebar, or something.

We can start by adding the structure for the Pali texts, and can ask Rod's guidance for the Chinese.

One complication with this is that there are sometime two "flows" of structure. The semantic structure of the text, which is of course the basic element of SC, and the convenmtyional reciter/text editor divisions. In tha Pali, for example, we have the "bhanavaras' to mark the end of recitation sections. More significant is the mahavagga/Culavagga of the Vinaya. In CBETA they use the juan.

In any case, the basic semantic structure should be straightforward enough.

yapcheahshen commented 9 years ago

In normal practice of TEI, semantic structure uses nested tag (div, p ) , since XML doesn't allow overlap tag, (＜a＞＜b＞＜/a＞＜/b＞is not well-formatted ) , media structure , which varies from version to version, has to use nulltag , ( pb , cb, lb ) .

I found that if we use null tag for all types of structural markup, in short, convert ＜vagga id="xx"＞....huge chunk of text....＜/vagga＞, to ＜vagga id="xx"/＞....＜vagga_end id="xx"/＞ makes thing much simpler and we can combine different semantic structure into same XML easily, just like we put page number of 4 versions in VRI CSCD XML files. and null-tags are less error prone and more friendly to human than deeply nested tags.

blake-sc commented 9 years ago

Interesting idea. So far we've been lazily using HTML instead of XML, although we could adapt this using the tag with a form like

Partial overlap is however an incredibly niche thing, most of the canonical structure is cleanly hierarchical.

Our solution to the nesting problem is more to use multiple files. If a file needs to be nested excessively, it's probably better off being multiple smaller files which are contained in appropriately named subdirectories. One of the main things in implementing full hierarchical data will be further leveraging the power of file systems for representing hierarchical data. We're actually moving away from flatter data structures to more fundamentally tree like ones, as trees are much better suited for when some branches nest much more deeply than others.

sujato commented 8 years ago

discussed elsewhere, see discourse

suttacentral / legacy-suttacentral

JSON full data structure #102