Guidance on splicing in raw HTML

crertel commented 2 years ago

Hello again!

I hate to be a bother--would you happen to know offhand an easy way of splicing HTML back in during AST traversal? I have a blob, and my current best guess was to feed it into Floki (which has a slightly different AST structure, one fewer field on the node tuples).

RobertDober commented 2 years ago

No problem So you would need to do some rendering before finishing the traversal?

Maybe an example?

crertel commented 2 years ago

So, what I'm doing is baking in some syntax highlighting using the highlight tool. This tool takes some text and emits HTML; I use Rambo to run that tool and feed it the source code that I'd like to have converted to syntax-highlighted HTML. I get the HTML back, and I'd like to replace my <code> block with it.

Things I've tried:

adding it as a child node directly: doesn't work, the HTML gets escaped
feeding the highlighted HTML to Earmark.as_ast and splicing that in: doesn't work, the markdown parser tweaks the HTML too much.
feed the highlighted HTML to Earmark.as_ast after wrapping it in a <div>/</div> string: doesn't work, somehow this breaks the CSS and it stops working.

So, at this point, my next guess is to feed it to Floki to get an AST back, and then augment the AST nodes with the meta term that Earmark uses.

If we had a way of injecting/specifying raw HTML during traversal, that too would work. Maybe adding {:replace_raw, "<string of HTML">} or similar, with semantics where the node is just replaced with the given HTML? What do you think?

crertel commented 2 years ago

The current approach looks basically like this:

  def render_code_node({"pre", _attrs, [{"code", _innerattrs, [body], meta} = node], _}) do
    classes = Earmark.AstTools.find_att_in_node(node, "class") || ""

    cond do
      classes == "" || classes =~ "inline" ->
        node

      true ->
        language = classes
        {:ok, src_html} = HighlightUtil.highlight_source_to_html("#{body}", language)                
        {:ok, src_html_ast, []} = EarmarkParser.as_ast(src_html)        
        {:replace, {"pre", [], [src_html_ast], meta}}
    end
  end

(I also still have no idea what the meta/4th tuple element is for!)

RobertDober commented 2 years ago

just a quick thought

in your first approach when you inject the HTML as a child node, have you tried to set meta to %{verbose: true} (or similar, need to check what EarmarkParser.as_ast returns for html tags

RobertDober commented 2 years ago

(I also still have no idea what the meta/4th tuple element is for!)

for exactly that reason, and annotations and to add your own custom: key which will never be used by the Parser as a contract

sodapopcan commented 2 years ago

So I just had this exact issue. I was finally able to solve it though it's convoluted and not battle-tested.

Here's the whole thing:

  def parse(markdown) do
    {result, _} =
      markdown
      |> Earmark.as_ast!()

      # Ensure that we are only looking for HTML within code blocks.
      # Whenever we hit a code block, we flip the accumulator to `true` so that the next
      # matching text node can match on `true` meaning it's inside a `code` tag.
      # This assume that our code tag has one class which is the name of the language.
      # We skip this if the code we're trying to show is HTML.
      |> Earmark.Transform.map_ast_with(false, fn
        {"code", [{"class", class}], _, meta}, _ when class != "html" ->
          {{"code", [{"class", class}], nil, meta}, true}

        html, true ->
          # Once we match on a text node we want to parse as HTML, this is where we do it!
          {ast, _} =
            html
            |> Floki.parse_fragment!()

            # Convert Floki's AST to Earmark's AST
            # I explain why we convert `span`s to `em`s below
            |> Floki.traverse_and_update(fn
              {"span", args, children} ->
                {"em", args, children, %{}}
            end)

            # So this part is a giant hack and a bit hard to explain.
            #
            # Once parsed, `span`s get multi-lined and we end up with:
            #
            #   <span class="k">def</span>
            #   <span class="k">foo</span>
            #
            # which means we get:
            #
            #   def
            #   foo
            #
            # We fix this by converting to `em`s however they also have a problem of
            # getting squished together.
            #
            #   <em>def</em><em>foo</em>
            #
            # leaving us with
            #
            #   deffoo
            #
            # The following convoluted code adds a space on the left of any `em` tag's
            # text node that immediately follows another `em`.  I'm hoping it can be
            # simplified a bit, but this what I came up with that works.
            |> Earmark.Transform.map_ast_with(nil, fn
              {"em", args, _, meta}, nil ->
                {{"em", args, nil, meta}, :em_first}

              {"em", args, _, meta}, :em_next ->
                {{"em", args, nil, meta}, :em_text}

              {tag, args, _, meta}, _ ->
                {{tag, args, nil, meta}, nil}

              text, :em_first ->
                {text, :em_next}

              text, :em_text ->
                {" " <> text, :em_next}

              node, _ ->
                {node, nil}
            end)

          {ast, false}

        # This is the catch-all from the outer iteration that resets the accumulator
        # i.e., it's saying we are no longer inside a code block.
        node, _ ->
          {node, false}
      end)

    Earmark.transform(result)
  end

I hope that was somewhat coherent!

@RobertDober, do you feel this is something that could belong in Earmark (hopefully with nicer code) or would the complexity not be worth it? I feel it would be nice to be able to seamlessly integrate with highlighting tools (like makeup!) but in the short time I've spent on this, I ran into a few edge-cases and I'm sure there are probably more.

RobertDober commented 2 years ago

My guess would be that you need:

Earmark.Transform.intersperse(ast, node, predicate_fn)

which will insert a node between any two ast nodes for which predicate_fn holds?

I am not sure this is a good idea, as this would be opening a can of worms. My idea of exposing the transformation functions was to encourage the creation of libraries of a higher abstraction level and not to engulf the Earmark library for all users and would maybe also allow me to add more functionality but on the same level of abstraction (and maybe your code is on the same level of abstraction).

That said I have way to little time to allocate to Earmark right now and especially the next three weeks I'll probably not even able to look at Github :cry: Because if that were not the case I would probably also have created an EarmarkAddOns project :wink:

However I will keep this open and feel free to explain why I might be wrong (but my replies will be sparse).

May I share some observations too:

Why is "a\nb" rendered in two different lines, I mean yes the resulting html is in two lines, but I fail to understand why this is a problem?
I still think this is a bug in Floki though:
```
Floki.parse_fragment!("<em>a</em> <em>b</em>")
[{"em", [], ["a"]}, {"em", [], ["b"]}]
```
At least EarmarkParser parses the markdown above correctly, but I repeat for readability
1. I still think this is a bug in Floki though:
```
Floki.parse_fragment!("<em>a</em> <em>b</em>")
[{"em", [], ["a"]}, {"em", [], ["b"]}]
```

sodapopcan commented 2 years ago

I'm sorry, re-reading I worded my question horribly. I merely meant solve the tag spacing issue to enable integration with highlighters, not actually explicitly integrate with them! I haven't come back to this yet since I got things working but you've given me some good info to look into when I do. I can try and verify if this is on floki's side or not---wasn't trying to create work for you :)

RobertDober commented 2 years ago

No worries, I think our exchange is cool, I too got confused about the missing space issue, but it is in Floki I am quite sure, however maybe it is not an issue in HTML (however I doubt it), anyway if I kept this open it was to encourage you to pitch your ideas, just that I do not have a lot of time for this :(

Bye for now

RobertDober commented 1 year ago

closing as I will not really be available for a potential follow up

pragdave / earmark

Guidance on splicing in raw HTML #447