mbutterick / pollen

book-publishing system [mirror of main repo at https://git.matthewbutterick.com/mbutterick/pollen]
https://git.matthewbutterick.com/mbutterick/pollen
MIT License
1.19k stars 64 forks source link

Incorrect conversion of quotes in an Urdu string #266

Closed saadatm closed 2 years ago

saadatm commented 2 years ago

Found a case in which smart-quotes incorrectly converts straight quotes of an Urdu string. It happens when the closing quote is immediately followed by an Urdu full stop (or some other Urdu punctuation mark):

; Correct
> (define str-en "This is \"a sentence\".")
> (display (smart-quotes str-en))
This is “a sentence”.

; Incorrect
> (define str-ur "یہ ایک \"جملہ ہے\"۔")
> (display (smart-quotes str-ur #:double-open "”" #:double-close "“"))
یہ ایک ”جملہ ہے”۔

The result should have been یہ ایک ”جملہ ہے“۔. Note that str-ur ends with U+06D4 ARABIC FULL STOP, which is the character for ending sentences in Urdu.

Interestingly, if we end the Urdu string with the English full stop (i.e. U+002E FULL STOP), the result is correct:

> (define str-ur-2 "یہ ایک \"جملہ ہے\".")
> (display (smart-quotes str-ur-2 #:double-open "”" #:double-close "“"))
یہ ایک ”جملہ ہے“.

(Sidenote: The output appears a bit weird due to the directionality of the characters. Here it is after applying the right-to-left direction: یہ ایک ”جملہ ہے“.)

I was expecting that if we ended the English string with the Urdu full stop, then the output would be incorrect, but it actually turned out to be correct:

> (define str-en-2 "This is \"a sentence\"۔")
> (display (smart-quotes str-en-2))
This is “a sentence”۔

So I tested with the combinations of:

  1. an English string,
  2. an Urdu string,
  3. English full stop, comma, question mark, and semicolon, and
  4. Urdu full stop, comma, question mark, and semicolon

... (with [o] and [c] acting as opening and closing curly quotes respectively for simplification):

> (for* ([str '("This is \"a sentence\"" "یہ ایک \"جملہ ہے\"")]
         [punctuation '("." "," "?" ";" "۔" "،" "؟" "؛")])
      (display (smart-quotes (string-append str punctuation "\n")
                             #:double-open "[o]" #:double-close "[c]")))
This is [o]a sentence[c].
This is [o]a sentence[c],
This is [o]a sentence[c]?
This is [o]a sentence[c];
This is [o]a sentence[c]۔
This is [o]a sentence[c]،
This is [o]a sentence[c]؟
This is [o]a sentence[c]؛
یہ ایک [o]جملہ ہے[c].
یہ ایک [o]جملہ ہے[c],
یہ ایک [o]جملہ ہے[c]?
یہ ایک [o]جملہ ہے[c];
یہ ایک [o]جملہ ہے[o]۔
یہ ایک [o]جملہ ہے[o]،
یہ ایک [o]جملہ ہے[o]؟
یہ ایک [o]جملہ ہے[o]؛

As the results are showing, the output is incorrect only when the Urdu string ends with Urdu punctuation. Not sure why this is happening, though.

mbutterick commented 2 years ago

smart-quotes isn’t part of the supported public interface for Pollen because it makes no attempt to behave well beyond the easy cases. (That’s why it’s in the unstable directory.)

I suggest taking the existing code as a starting point and making a function that works better for your project.

saadatm commented 2 years ago

Thanks. I fiddled with the code, and adding Urdu punctuation marks in sentence-ender-exceptions fixed the issue.

I'll be using the modified version in my project, but how does adding a new keyword argument to smart-quotes for passing additional punctuation marks sound to you? It can be an empty string by default, and whatever the user passes in it can be appended to the regex being used in sentence-ender-exceptions. I'll be glad to open a PR if you think it's a good idea. As far as I can tell, sentence-ender-exceptions is not used anywhere other than smart-quotes.

mbutterick commented 2 years ago

1) For your purposes, would it fix smart-quotes to add the Urdu full stop to the current value of sentence-ender-exceptions?

2) Is there some Unicode character class that covers what sentence-ender-exceptions is attempting to be?

I’m not averse to the PR you propose, but I think sentence-ender-exceptions is the wrong way to do things, and building a public interface around it doesn’t make it less wrong.

saadatm commented 2 years ago
  1. Yes. But not just the Urdu full stop — Urdu comma (،), Urdu question mark (؟), and Urdu semicolon (؛) too.
  2. I am not sure. Maybe Punctuation, Other (Po) (minus the straight quotes); Punctuation, Open (Ps); and Punctuation, Close (Pe)?
saadatm commented 2 years ago

I have added a custom smart quotes function (based on the original) in my project. Thanks for the discussion. :-)