blake-sc commented 4 years ago

At the moment we identify the translations by a URL like:

    "source_url": "https://github.com/suttacentral/bilara-data/tree/master/translation/en/sujato/sutta/kn/thag",
OR
    "source_url": "https://github.com/sc-voice/bilara-data/tree/master/translation/de/sabbamitta/sutta/mn",

And we identify the root edition simply with this:

    "root_lang": "pli",

The server uses those two key:value pairs to try and identify the translation files in the repository, and to make an educated guess about what the root files are.

Proposal

To reduce the hackiness in the code, eliminate the URL aspect and make an explicit source using relative paths:

"root_path": "root/pli/ms/sutta/kn/thag",
"translation_path": "translation/en/sujato/sutta/kn/thag",

The Server would use those two key:value pairs to unambiguously identify the translations and the root files within the working repository. This proposal does not require changing the "source_url", it simply won't be used by the bilara server.

It is worth noting though, that the root_path and translation_path, would implicitly define the root_lang, translation_lang, text_uid and pitaka, so unless it's desired to explicitly have these as keys they aren't needed.

Along with reducing code complexity, the use of an explicit root path would allow multiple editions to co-exist in harmony while using the same language code (at the moment, since the language code is the ONLY thing which informs the server what the root files are, multiple editions under the same code confuses it).

Extended Proposal

The above change would be relatively trivial to make. A less trivial change would be to virtualize projects and translation files. That is to say, simply by virtue of adding the root_path and translation_path to an entry in _publications.json, the server would automatically create virtual translation files, which would exist in the browse view but wouldn't exist in the repository until touched by a translator. This would be instead of using the "create_translations.py" script to add empty stub files.

Should it be desired to have the possibility to explicitly limit the scope of a project such as when a translator specifically intends to translate a specific fraction of a collection of texts, this could be done using a scope, such as:

"root_path": "root/pli/ms/sutta/kn",
"translation_scope": ["thig", "thag"],

sabbamitta commented 4 years ago

AN10.121:1.1: “Developers, the dawn is the forerunner and precursor of the sunrise. In the same way a new issue proposed by Blake is the forerunner and precursor of skillful innovations. A new issue proposed by Blake gives rise to all devs thinking about it. All devs thinking about it gives rise to discussing it. Discussing it gives rise to making decisions. Making decisions gives rise to making a clear plan. Making a clear plan gives rise to writing good code. Writing good code gives rise to a new version to test. A new version to test gives rise to helpful feedback. Helpful feedback gives rise to further improvement. Further improvement gives rise to an awesome new feature!”

:smile_cat:

firepick1 commented 4 years ago

Excellent suggestion. Voice uses source_url currently and basically derives what this proposal lays out. Having the fields suggested would indeed streamline Voice robots and lead to an eventual deprecation of source_url.

sujato commented 4 years ago

Sorry, but I don't like this idea, for two reasons.

The root text is strictly unknown. Yes, users will normally rely on the MS edition when using Bilara. However what happens when there are two editions and they wish to see them both in Bilara? Or when, as with the Parivara that Brahmali is currently working, the original translation used the PTS edition and the revision uses MS? Or if someone simply pastes a legacy translation into Bilara? It is, in fact, the normal situation that a translator relies on multiple editions when making a translation, not on one source only. In addition, there will be many translators who do not use a Pali text as their source at all, much less a specific edition: they will translate from translations. We note the root language so we know what the "root" is, which is the "Pali Tipitaka", but this need not be instantiated as a specific edition, and it is misleading to suggest that this is the case. The nature of how translators used their sources, since it is so widely variable, belongs in translation_process as a descriptive field.
The proposal mixes application-level concerns and data. We have discussed this before and I have been at pains to eliminate it in our navigation data. The concerns of publication.json are not with the application: it is a record of publication data. We should anticipate that it will be used in a wide variety of contexts, not only by developers, but by publishers, editors, readers, and the like. Mixing application-level concerns adds complexity and it will prove brittle in terms of maintenance and future development.

FYI, the source URL is not intended for application use, although obviously applications can use it. It's intended as handy way to find the canonical location of the translation, and thus must be absolute not relative.

I suggest:

Root Pali text is determined at the application level. If applications are confused by multiple editions, tell them to use ms as default. Then let the user select a different one if they wish. A translation must not be tied to a specific root text. (This applies equally to SC and Bilara).
To identify a translation unambiguously requires only three things: lang, author, UID. Directory structure is not needed and should, in fact, be assumed to be variable and subject to change. Maintainers of _publication.json should be expected to know what their text is, but not where it is located in the directory structure.

I understand that this shifts a certain degree of complexity from the data to the applications, but that is where it belongs. The data should be 100% pure and atomic. Consider: our Pali text was originally keyed in by the VRI in the mid-90s. We are still using the data that was created then, which has since been used for the CSCD, the VRI website, the Mahasangiti website, the DPR, and SC old and new (among others). Applications come and go; data remains.

As for the virtualization proposal, I don't understand it well enough to have an opinion. Not yet anyway!

sujato commented 4 years ago

Just to say, I forgot to mention: if you want to use data in this form, that's not an issue. Just keep it somewhere else. Make a _paths.json or something and keep it somewhere like /bilara-data/.helpers/

suttacentral / bilara

Change how _publication.json works to be more saner and more explicit for our robotic friends #73

Proposal

Extended Proposal