SuttaLint: Force correct punctuation and spacing

sujato commented 3 years ago

It is very common in raw texts for punctuation and spacing to be used in inconsistent and incorrect ways. We should process the text in bilara-data so that it forces correct and consistent usage as far as possible.

Multiple languages
Deeply nested quotes (up to four or five levels)
Widely separated open and close quotes (up to a whole sutta)
There will surely, in our massive corpus, be exceptions to any rule, so there should be no invisible changes: changes should be visible to the author.
Certain issues are peculiar to us; for example Greek Ᾱ to Latin Ā.
Has to work with Bilara.

These transformations should be applied in the published branch. Any updates to published texts should trigger the corrections again.

Certain transformations may not be universally applicable. For example, inserting soft-hyphens (to allow long Pali words to break) should be per-application. These should not be done in bilara-data.

[ ] Transform the text in Bilara-data when it is applicable to all applications
[ ] Document the specific changes that are made.

Following is a list of proposed changes. Not all are necessary; if they prove difficult or buggy, then leave it out.

Global changes

double hyphen to en-dash: – → –
minus to en-dash: − → –
triple hyphen to em-dash: — → —
triple dots to ellipsis: … → …
no line breaks in segments
Greek Ᾱ to Latin Ā (except for, well, Greek!)
No double-space.
No no-break space. (except before en-dash)
No space after em-dash: —
One trailing space at the end of the segment except following em-dash.
No space before the following: .,:;?!—-.
If en-dash is surrounded by spaces, make the preceding space a no-break-space:  –  →  –

Quote marks

Quotation marks are complex, and vary a lot between languages. Moreover, they are very hard to get right in the Suttas, due to the presence of many multi-segment quotes and multiple levels of nesting. Perhaps we should not try to do this automatically, and let it be up to the authors.

Possibly useful libraries

https://github.com/errata-ai/vale
https://textlint.github.io/
https://github.com/Zemke/instant-smart-quotes
Discourses has a quote-correcter, I think based on smartypants.

ccronje commented 3 years ago

I think the tool should assist the translator in the Bilara app in real time (like an auto-correction tool). This way the translator can make decisions that are too difficult for a tool to make post-translation. For example, https://github.com/Zemke/instant-smart-quotes supports multiple languages. Perhaps we could auto-select the language from _project.json data. The tool should be ON by default to ensure a consistent approach to punctuation across Bilara projects, but the translator could have the option to toggle the tool OFF if 'manual override' is required.

firepick1 commented 3 years ago

Detection is far easier than correction. The original idea of lint is detection, allowing the user to decide what works for them. Correction, on the other hand, is only really effective in systems that have strict syntax. Human languages lack that invariant rigor so automated correction of prose is a bit perilous.

Even detection is fraught with peril. I regularly wish to obliterate the obnoxious spelling detectors that underline words correctly spelled but unknown to the spelling detector.

sujato commented 2 years ago

I think we are going to leave this aside for the time being. Probably better to handle it via a browser plugin per-language. We have implemented trailing space correction, and this could be extended to cover some unambiguous cases on a language-agnostic basis. Eg. ... -> …

suttacentral / bilara-data