Closed rgov closed 2 months ago
Heya! Definitely possible. It‘s been ages since I looked into this algorithm.
Here’s the code:
The SmartyPants pl code seems to be around L442 there.
As you have that running, perhaps you are able to add some print
statements there, to debug which steps are doing the transform, and then see what’s missing in the code here?
Thanks. Normally, if you write say ("This is a test")
, the closing quote is converted with this case:
But, in my example ("This is a test.")
, the conditions aren't met and it falls through to the default case (opening quote). The condition isn't met because the previous node is a PunctuationNode.
{
type: 'PunctuationNode',
value: '.',
position: {
start: { line: 1, column: 17, offset: 16 },
end: { line: 1, column: 18, offset: 17 }
}
}
The part of the original Perl that transforms that character is here:
my $close_class = qr![^\ \t\r\n\[\{\(\-]!;
# Double closing quotes:
s {
($close_class)?
"
(?(1)|(?=\s)) # If $1 captured, then do nothing;
# if not, then make sure the next char is whitespace.
} {$1”}xg;
The logic between the two is not equivalent.
$close_class
is any character excluding whitespace, hyphens, and opening brackets ([{
.$close_class
, then the next character must be whitespace.Note there is a second place where the Perl code outputs a double closing quote, which is:
# Special case if the very first character is a quote
# followed by punctuation at a non-word-break. Close the quotes by brute force:
s/^"(?=$punct_class\B)/”/;
In retext-smartypants that's here, but again the translation differs. I don't think nextNext.type !== 'WordNode'
is semantically equivalent to \B
.
I think anywhere the Perl code uses $punct_class
you're clear to use node.type === 'PunctuationNode' || node.type === 'SymbolNode'
but where the Perl code uses other classes it's not equivalent.
nice research!
want to work on a PR, test out whether that change is enough?
I interpret the meaning of \B
as basically "the following character is the same type as this one" where "type" means "word" or "non-word":
if (
next &&
nextNext &&
(next.type === 'PunctuationNode' || next.type === 'SymbolNode') &&
((next.type === 'WordNode' && nextNext.type === 'WordNode') ||
(next.type !== 'WordNode' && nextNext.type !=='WordNode'))
)
But there's a weird hitch where non-word at the end of the string matches ("%" =~ /.\B/
), because we pretend there's a non-word at the end of the string:
\b
and\B
assume there's a non-word character before the beginning and after the end of the source string; so\b
will match at the beginning (or end) of the source string if the source string begins (or ends) with a word character. Otherwise,\B
will match.
Translating that, I think it would be:
if (
next &&
(next.type === 'PunctuationNode' || next.type === 'SymbolNode') &&
(!nextNext || nextNext.type !== 'WordNode')
)
Here are some test cases:
Test case | Explanation | SmartyPants 1.5.1 | retext-smartypants 6.1.0 | Suggested fix |
---|---|---|---|---|
"~a |
Quote, punctuation, word break | “~a |
✅ “~a |
✅ “~a |
"~* |
Quote, punctuation, non-word-break, different symbols | ”~* |
✅ ”~* |
✅ ”~* |
"~~a |
Quote, punctuation, non-word-break, same symbol | ”~~a |
❌ “~~a |
❌ “~~a |
"~ |
Quote, punctuation, end-of-string | ”~ |
❌ “~ |
✅ ”~ |
The third test case still fails because repeated occurrences of the same symbol are combined into a single node. How do you recommend we get the desired behavior? Something with next.value.length
?
thanks, released! https://github.com/retextjs/retext-smartypants/releases/tag/6.1.1
I don’t really care for the "~~a
difference. SmartyPants is buggy. This uses ASTs and can be smarter. If there was a good real case, then maybe.
Initial checklist
Affected packages and versions
6.1.0
Steps to reproduce
I am transforming text that contains a parenthetical quote written in quotation marks, like:
("This is a test.")
. The output contains two opening double quotes. The second quote should be a closing double quote.As a more minimal input
("A.")
reproduces the issue. Interestingly, removing the period causes it to output correctly.I've encoded the resulting output as hex-encoded-UTF-8 to minimize the chance of macOS helpfully correcting anything for me.
The above code outputs
28e2809c412ee2809c29
, showing that both the quotation marks were transformed to the same sequence,e2809c
, an opening double quote.SmartyPants 1.5.1 transforms the text correctly:
Affected runtime and version
node@22.5.1