vsch / flexmark-java

CommonMark/Markdown Java parser with source level AST. CommonMark 0.28, emulation of: pegdown, kramdown, markdown.pl, MultiMarkdown. With HTML to MD, MD to PDF, MD to DOCX conversion modules.
BSD 2-Clause "Simplified" License
2.26k stars 269 forks source link

Typographic Extension silently dropping characters #547

Open garretwilson opened 1 year ago

garretwilson commented 1 year ago

Using Flexmark 0.64.0 with Java 17 I'm expecting the Typographic Extension to turn ' into ’, but instead it seems merely to drop the character. The same happens e.g. with ---.

I'm setting up my parser and HTML renderer like this:

MutableDataHolder parserOptions = new MutableDataSet()
    //emoji; see https://www.webfx.com/tools/emoji-cheat-sheet/
    .set(EmojiExtension.USE_IMAGE_TYPE, EmojiImageType.UNICODE_ONLY)
    //GFM tables
    .set(TablesExtension.COLUMN_SPANS, false).set(TablesExtension.APPEND_MISSING_COLUMNS, true).set(TablesExtension.DISCARD_EXTRA_COLUMNS, true)
    .set(TablesExtension.HEADER_SEPARATOR_COLUMN_MATCH, true)
    //extensions
    .set(Parser.EXTENSIONS, List.of(DefinitionExtension.create(), EmojiExtension.create(), SuperscriptExtension.create(), TablesExtension.create(),
        TypographicExtension.create(), YamlFrontMatterExtension.create()));
parser = Parser.builder(parserOptions).build();
htmlRenderer = HtmlRenderer.builder().build();

Note that I just use TypographicExtension.create(). Maybe there are further configurations to do, but by default I wouldn't expect the extension just to drop characters.

I use the parser like this:

com.vladsch.flexmark.util.ast.Document markdownDocument = parser.parse("it's working");
System.out.println(htmlRenderer.render(markdownDocument));

I expect:

<p>it&rsquo;s working</p>

Instead I get:

<p>its working</p>

Why is the extension dropping the characters altogether? Or is the problem in the renderer, that needs some configuration to show the character references? Wherever the problem is, I would expect some notification instead of simply dropping the characters altogether. Silently deleting content is never welcome.

garretwilson commented 1 year ago

I tried mucking with the settings, using a literal character instead of a character reference:

.set(TypographicExtension.ENABLE_QUOTES, true)
.set(TypographicExtension.SINGLE_QUOTE_UNMATCHED, "x")

Nothing changed. The character still simply disappeared

However I was able to disable quote processing altogether:

.set(TypographicExtension.ENABLE_QUOTES, false)

Then I got my original string back. (Of course this defeats the purpose of the extension altogether.)

Does this mean the Typographic Extension is simply broken, or am I missing some additional configuration? (In any case, I certainly wouldn't expect the default configuration to silently discard content.)

ghost commented 1 year ago

For KeenWrite, I use KeenQuotes to perform typographic changes.

Essentially:

final var document = "'Hello there, Garret! What's up?'";
final var contractions = new Contractions.Builder().build();
final var typographer = new Curler( contractions, FILTER_PLAIN, true );
final var curled = typographer.apply( document );
System.out.println( curled );

Produces:

&lsquo;Hello there, Garret! What&apos;s up?&rsquo;

Take a look at the unit tests (here and here) to see what quoting scenarios can be resolved by KeenQuotes.

For a more complex example, see the demo page.

Note that KeenQuotes has an XML filter, so you can run it either before flexmark using the plain text filter, or on the resulting HTML document afterwards using the XML filter (provided JSoup is used to make the HTML well-formed).

ghost commented 1 year ago

As a side note, the following is semantically incorrect:

<p>it&rsquo;s working</p>

What needs to be produced is:

<p>it&apos;s working</p>

The issue is that most fonts don't curl the apostrophe, but treat it as a straight quote. KeenQuotes produces semantically correct entities, leaving it to another layer to figure out how to display the apostrophe. For example, KeenWrite Themes instructs the typesetting software to curl the apostrophes using the following command:

\definefontfeature[default][default][trep=yes]