suttacentral / bilara

Our Computer Aided Translation software
10 stars 8 forks source link

forbid input of strings with straight quotes and other infelicities #113

Open sujato opened 2 years ago

sujato commented 2 years ago

We have a recurring problem with translators inputting "straight quotes" in Bilara. We have effectively dismissed the idea of automatically correcting quotes, there are too many variables.

Perhaps, instead, we can make the Bilara front end simply reject any straight quotes? When the translator inputs a straight quote and presses enter, the whole screen flashes bright red and a siren screams, and a warning pops up: "Danger, inappropriate quote mark!" Or maybe something less dramatic!

Anyway the point is the translator can't enter straight quotes even by accident. And they get some kind of indication of what the problem is.

Perhaps some other things could trigger this, like obviously incorrect punctuation:

..
,.
,,

Etc.

sabbamitta commented 2 years ago

Sounds like a good idea to exclude straight quote marks from being entered. (I especially like the siren!)

But I'd be careful with other sorts of punctuation, given the many languages that exist in the world. Maybe still exclude curly brackets, as they are also part of JSON code.

For example in German, you may find a combination –,, which you'd probably not expect either, but it's correct punctuation. Here an English sentence with German punctuation:

If you come in winter – which is currently the case –, it will be cold.

A relative clause has to be separated by comma, and if that relative clause contains a phrase that is surrounded by dashes, the comma rule still applies. The comma even overruns the rule that the dashes are surrounded by spaces.

Who knows what unexpected combinations other languages may have!

sujato commented 2 years ago

Indeed, we should probably stick to just blocking the quote marks. Maybe we can keep an eye out for other cases if they arise.

I think curly and straight brackets are okay, though, because if they are inside the JSON they are just treated as a part of the string. The problem with the quote marks is that they say "end the string here".

sujato commented 1 year ago

Perhaps we could blacklist input, then whitelist per language.

For example it is quite common in the wild to find Ᾱnanda instead of Ānanda. See the difference? The first one is Greek. So this could be forbidden for all languages except Greek.