jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.29k stars 3.37k forks source link

unicode character seems to swallow other characters during a round-trip conversion from HTML to RTF and back #8264

Closed huyz closed 2 years ago

huyz commented 2 years ago

Explain the problem.

When I have a non-breaking space in HTML and convert to RTF and then back to HTML, this causes adjoining characters to be swallowed:

❯ printf '<span>\xC2\xA0</span>curious\n' | pandoc -f html -t rtf | pandoc -f rtf -t html
<p> urious</p>

I feel that the c of curious shouldn't be destroyed somehow during this round-trip.

Pandoc version? pandoc 2.19.2 macOS 12.5.1 (Monterey) Apple M1 Max (ARM)

jgm commented 2 years ago

Simple repro:

% pandoc -f rtf -t html
{\pard \ql \f0 \sa180 \li0 \fi0 a\u160?c\par}
<p>a </p>

The c disappears.

jgm commented 2 years ago

Relevant part of tokenization:

Tok (line 1, column 33) (UnformattedText "a")
Tok (line 1, column 34) (ControlWord "u" (Just 160))
Tok (line 1, column 40) (UnformattedText "c")
Tok (line 1, column 41) (ControlWord "par" Nothing)
jgm commented 2 years ago

Relevant parts of the spec:

\uN This keyword represents a single Unicode character which has no equivalent ANSI representation based on the current ANSI code page. N represents the Unicode character value expressed as a decimal number.This keyword is followed immediately by equivalent character(s) in ANSI representation. In this way, old readers will ignore the \uN keyword and pick up the ANSI representation properly. When this keyword is encountered, the reader should ignore the next N characters, where N corresponds to the last \ucN value encountered.As with all RTF keywords, a keyword-terminating space may be present (before the ANSI characters) which is not counted in the characters to skip. While this is not likely to occur (or recommended), a \bin keyword, its argument, and the binary data that follows are considered one character for skipping purposes. If an RTF scope delimiter character (that is, an opening or closing brace) is encountered while scanning skippable data, the skippable data is considered to be ended before the delimiter. This makes it possible for a reader to perform some rudimentary error recovery. To include an RTF delimiter in skippable data, it must be represented using the appropriate control symbol (that is, escaped with a backslash,) as in plain text. Any RTF control word or symbol is considered a single character for the purposes of counting skippable characters.An RTF writer, when it encounters a Unicode character with no corresponding ANSI character, should output \uN followed by the best ANSI representation it can manage. Also, if the Unicode character translates into an ANSI character stream with count of bytes differing from the current Unicode Character Byte Count, it should emit the \ucN keyword prior to the \uN keyword to notify the reader of the change.RTF control words generally accept signed 16-bit numbers as arguments. For this reason, Unicode values greater than 32767 must be expressed as negative numbers.

\ucN This keyword represents the number of bytes corresponding to a given \uN Unicode character. This keyword may be used at any time, and values are scoped like character properties. That is, a \ucN keyword applies only to text following the keyword, and within the same (or deeper) nested braces. On exiting the group, the previous \uc value is restored. The reader must keep a stack of counts seen and use the most recent one to skip the appropriate number of characters when it encounters a \uN keyword. When leaving an RTF group which specified a \uc value, the reader must revert to the previous value. A default of 1 should be assumed if no \uc keyword has been seen in the current or outer scopes.A common practice is to emit no ANSI representation for Unicode characters within a Unicode destination context (that is, inside a \ud destination.). Typically, the destination will contain a \uc0 control sequence. There is no need to reset the count on leaving the \ud destination as the scoping rules will ensure the previous value is restored.

jgm commented 2 years ago

In this case sEatChars is being set to 1.

jgm commented 2 years ago

So it's eating the UnformattedText "c". I think we should have an UnformattedText "?" in there for it to eat. So the problem may be in tokenization.

jgm commented 2 years ago

Maybe it's a RTF writer issue? The parameter (160) is supposed to have a delimiter, which is a space or nonalphabetic, nonnumeric character. Here that's going to be the '?', which I think is actually meant to stand in for the character if it can't render the unicode character. Putting a space before the ? in the RTF code fixes the issue.

huyz commented 2 years ago

Wow that was quick. Thanks!