a virtuous AI cycle - Githubissues

draft proposal: integrate Bilara with AI models to expand translations and improve segmenting

Let us imagine a better world.

data

In this world, all the Buddhist texts participate in a unified data system which allows for continuous integration and feedback at all levels, while respecting the domain knowledge of specialists.

This is based on the simple system of bilara-data:

everything is a key:value
keys are unique in the entire corpus
each key:value pair represents a semantically meaningful linguistic unit, such as a sentence, a phrase, a heading, a doctrinal statement, or a line of verse.
a value may be anything (root text, translation, comment, references, markup, etc.)

Bilara also incorporates some other functions, such as the ability to define publications with their own metadata.

Bilara has evolved to suit SC's needs, so it doesn't have a rigorous spec, which would be required to give assurance to application developers.

a virtuous cycle

segments

segmented text is added to data model
segments are matched by AI based on linguistic or semantic similarity
revisions to segments are proposed by AI
translators or editors accept or reject suggestions
revised segments feed back into data model

translations

translated texts based on segments are added to data model
AI understands that "this" translations renders "that" root text, because the key is the same.
ML suggests translations for segments
translator accepts, rejects, or modifies them
these feed back into data model

cross disciplines: common data, uncommon applications

We need a way for the work to coordinate across disciplines, so that we can improve and cross-fertilize each other's work, while respecting the specific expertise of domain specialists.

For example, what happens if we have a segmented Tibetan text on SuttaCentral, then a Tibetan expert determines that a specific segment should be changed?

One approach would be to a common store as a Git repo. All the relevant data is kept there. The data can be pulled into different applications as needed. Domain specialists would have write privileges for their domains, typically assigned by language.

If this is unwieldy, another approach would be to keep the data repos separate, but with well-defined scopes. For example, SuttaCentral could have canonical Pali, another repo might have post-canonical Pali, and other Tibetan, and so on. That way each project could be managed independently, so long as the data was kept to spec. Again, projects could pull data as they needed.

A website might, for example, present Tibetan translations, but could still draw on the unified data model of the AI for, say, search.

suttacentral / bilara