gjtorikian / commonmarker

Ruby wrapper for the comrak (CommonMark parser) Rust crate
MIT License
416 stars 80 forks source link

Newlines mysteriously inserted if url is sufficiently long? #241

Closed duhaime closed 1 year ago

duhaime commented 1 year ago

Hello, we're seeing some strange behavior I'm hoping you can help us diagnose. We have markdown strings and want to convert them to plaintext. However, we want to ensure that the url destinations of links are preserved in the rendered plaintext:

md_string = "- [a](https://aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.com)"

doc = CommonMarker.render_doc(md_string, :DEFAULT, [:autolink])

doc.walk do |node|
  if node.type == :link
    text_node = MarkdownRenderer.text_node(node)
    text_node.string_content = "#{text_node.string_content} (#{node.url})"
    node.insert_before(text_node)
    node.delete
  end
end

doc.to_plaintext

This returns:

"  - a (https://aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.com)\n"

There's more leading and trailing whitespace than I'd expect here, but that's fine. What's puzzling is if we add just one more "a" to the url, we get:

"  - a\n    (https://aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.com)\n"

(NB: the newline between the link text and the link destination).

Is there a way to prevent this behavior?

gjtorikian commented 1 year ago

To be honest, I would expect this to be a core issue with https://github.com/github/cmark-gfm, as the Ruby library itself does no actual Markdown -> HTML conversion -- it just provides an interface to accept a string, then passes it over to the C lib to do the actual work.

Would you be able to write a test case against the C lib to verify if the same behavior appears there, too? The string_content function is defined here: https://github.com/github/cmark-gfm/blob/2d65cd3c4bfbbdddc7accefc76392c16bb0cfb6d/src/node.c#L421-L424

gjtorikian commented 1 year ago

Closed due to lack of further info. Feel free to reopen.