suttacentral / bilara-data

Content for Bilara translation webapp.
https://bilara.suttacentral.net
28 stars 37 forks source link

SuttaLint: Force correct punctuation and spacing #351

Closed sujato closed 2 years ago

sujato commented 3 years ago

It is very common in raw texts for punctuation and spacing to be used in inconsistent and incorrect ways. We should process the text in bilara-data so that it forces correct and consistent usage as far as possible.

These transformations should be applied in the published branch. Any updates to published texts should trigger the corrections again.

Certain transformations may not be universally applicable. For example, inserting soft-hyphens (to allow long Pali words to break) should be per-application. These should not be done in bilara-data.

Following is a list of proposed changes. Not all are necessary; if they prove difficult or buggy, then leave it out.

Global changes

Quote marks

Quotation marks are complex, and vary a lot between languages. Moreover, they are very hard to get right in the Suttas, due to the presence of many multi-segment quotes and multiple levels of nesting. Perhaps we should not try to do this automatically, and let it be up to the authors.

Possibly useful libraries

ccronje commented 3 years ago

I think the tool should assist the translator in the Bilara app in real time (like an auto-correction tool). This way the translator can make decisions that are too difficult for a tool to make post-translation. For example, https://github.com/Zemke/instant-smart-quotes supports multiple languages. Perhaps we could auto-select the language from _project.json data. The tool should be ON by default to ensure a consistent approach to punctuation across Bilara projects, but the translator could have the option to toggle the tool OFF if 'manual override' is required.

firepick1 commented 3 years ago

Detection is far easier than correction. The original idea of lint is detection, allowing the user to decide what works for them. Correction, on the other hand, is only really effective in systems that have strict syntax. Human languages lack that invariant rigor so automated correction of prose is a bit perilous.

Even detection is fraught with peril. I regularly wish to obliterate the obnoxious spelling detectors that underline words correctly spelled but unknown to the spelling detector.

sujato commented 2 years ago

I think we are going to leave this aside for the time being. Probably better to handle it via a browser plugin per-language. We have implemented trailing space correction, and this could be extended to cover some unambiguous cases on a language-agnostic basis. Eg. ... -> …