Closed saadatm closed 2 years ago
smart-quotes
isn’t part of the supported public interface for Pollen because it makes no attempt to behave well beyond the easy cases. (That’s why it’s in the unstable
directory.)
I suggest taking the existing code as a starting point and making a function that works better for your project.
Thanks. I fiddled with the code, and adding Urdu punctuation marks in sentence-ender-exceptions
fixed the issue.
I'll be using the modified version in my project, but how does adding a new keyword argument to smart-quotes
for passing additional punctuation marks sound to you? It can be an empty string by default, and whatever the user passes in it can be appended to the regex being used in sentence-ender-exceptions
. I'll be glad to open a PR if you think it's a good idea. As far as I can tell, sentence-ender-exceptions
is not used anywhere other than smart-quotes
.
1) For your purposes, would it fix smart-quotes
to add the Urdu full stop to the current value of sentence-ender-exceptions
?
2) Is there some Unicode character class that covers what sentence-ender-exceptions
is attempting to be?
I’m not averse to the PR you propose, but I think sentence-ender-exceptions
is the wrong way to do things, and building a public interface around it doesn’t make it less wrong.
،
), Urdu question mark (؟
), and Urdu semicolon (؛
) too.I have added a custom smart quotes function (based on the original) in my project. Thanks for the discussion. :-)
Found a case in which
smart-quotes
incorrectly converts straight quotes of an Urdu string. It happens when the closing quote is immediately followed by an Urdu full stop (or some other Urdu punctuation mark):The result should have been
یہ ایک ”جملہ ہے“۔
. Note thatstr-ur
ends withU+06D4 ARABIC FULL STOP
, which is the character for ending sentences in Urdu.Interestingly, if we end the Urdu string with the English full stop (i.e.
U+002E FULL STOP
), the result is correct:(Sidenote: The output appears a bit weird due to the directionality of the characters. Here it is after applying the right-to-left direction:
یہ ایک ”جملہ ہے“.
)I was expecting that if we ended the English string with the Urdu full stop, then the output would be incorrect, but it actually turned out to be correct:
So I tested with the combinations of:
... (with
[o]
and[c]
acting as opening and closing curly quotes respectively for simplification):As the results are showing, the output is incorrect only when the Urdu string ends with Urdu punctuation. Not sure why this is happening, though.