Closed sujato closed 2 years ago
I think the tool should assist the translator in the Bilara app in real time (like an auto-correction tool). This way the translator can make decisions that are too difficult for a tool to make post-translation. For example, https://github.com/Zemke/instant-smart-quotes supports multiple languages. Perhaps we could auto-select the language from _project.json
data. The tool should be ON
by default to ensure a consistent approach to punctuation across Bilara projects, but the translator could have the option to toggle the tool OFF
if 'manual override' is required.
Detection is far easier than correction. The original idea of lint is detection, allowing the user to decide what works for them. Correction, on the other hand, is only really effective in systems that have strict syntax. Human languages lack that invariant rigor so automated correction of prose is a bit perilous.
Even detection is fraught with peril. I regularly wish to obliterate the obnoxious spelling detectors that underline words correctly spelled but unknown to the spelling detector.
I think we are going to leave this aside for the time being. Probably better to handle it via a browser plugin per-language. We have implemented trailing space correction, and this could be extended to cover some unambiguous cases on a language-agnostic basis. Eg. ... -> …
It is very common in raw texts for punctuation and spacing to be used in inconsistent and incorrect ways. We should process the text in bilara-data so that it forces correct and consistent usage as far as possible.
These transformations should be applied in the
published
branch. Any updates to published texts should trigger the corrections again.Certain transformations may not be universally applicable. For example, inserting soft-hyphens (to allow long Pali words to break) should be per-application. These should not be done in bilara-data.
Following is a list of proposed changes. Not all are necessary; if they prove difficult or buggy, then leave it out.
Global changes
.,:;?!—-
. – 
→ – 
Quote marks
Quotation marks are complex, and vary a lot between languages. Moreover, they are very hard to get right in the Suttas, due to the presence of many multi-segment quotes and multiple levels of nesting. Perhaps we should not try to do this automatically, and let it be up to the authors.
Possibly useful libraries