remarkjs / remark-rehype

plugin that turns markdown into HTML to support rehype
https://remark.js.org
MIT License
258 stars 18 forks source link

Assign more fields from `definition` node into `imageReference` node #37

Closed ben519 closed 2 months ago

ben519 commented 3 months ago

Initial checklist

Problem

I have some markdown like this 👇

![A2][hello]

[hello]: myvid.mov

I turn it into mdast like this 👇

{
  "type": "root",
  "children": [
    {
      "type": "paragraph",
      "children": [
        {
          "type": "imageReference",
          "alt": "A2",
          "label": "hello",
          "identifier": "hello",
          "referenceType": "full"
        }
      ],
    },
    {
      "type": "definition",
      "identifier": "hello",
      "label": "hello",
      "title": null,
      "url": "http://127.0.0.1:9199/myvid.mov",
      "data": {
        "hName": "video",
        "hProperties": {
          "controls": true
        }
      }
    }
  ],
}

Note the data property of the definition node, particularly the line "hName": "video". My hope was that this would render a <video> tag in the HTML...

remarkRehype turns this mdast into the following hast 👇

{
  "type": "root",
  "children": [
    {
      "type": "element",
      "tagName": "p",
      "properties": {},
      "children": [
        {
          "type": "element",
          "tagName": "img",
          "properties": {
            "src": "http://127.0.0.1:9199/myvid.mov",
            "alt": "A2"
          },
          "children": [],
        }
      ],
    }
  ],
}

The resulting HTML element is a <img>, NOT a <video>. I understand why this happens, but I think the attributes in the definition node should (mostly) overwrite the attributes in the imageReference node.

Solution

The relevant code is in /lib/handlers/image-reference.js

import {normalizeUri} from 'micromark-util-sanitize-uri'
import {revert} from '../revert.js'

/**
 * Turn an mdast `imageReference` node into hast.
 *
 * @param {State} state
 *   Info passed around.
 * @param {ImageReference} node
 *   mdast node.
 * @returns {Array<ElementContent> | ElementContent}
 *   hast node.
 */
export function imageReference(state, node) {
  const id = String(node.identifier).toUpperCase()
  const def = state.definitionById.get(id)

  if (!def) {
    return revert(state, node)
  }

  /** @type {Properties} */
  const properties = {src: normalizeUri(def.url || ''), alt: node.alt}

  if (def.title !== null && def.title !== undefined) {
    properties.title = def.title
  }

  /** @type {Element} */
  const result = {type: 'element', tagName: 'img', properties, children: []}
  state.patch(node, result)
  return state.applyData(node, result)
}

Need to think about the appropriate solution here. Would be curious to hear your thoughts.?

Alternatives

I guess I could make my own handler, but I think this behavior should be the default.

wooorm commented 3 months ago

My hope was that this would render a <video> tag in the HTML...

It’s an understandable hope!

But, given that you make your own transform, where you tried to set things on definitions and then found it didn’t work, is it then not rather simple to change that transform? To work on link references?

This to me feels a bit 🤷‍♂️ You’d at least expect that the h* stuff on the references works. Right? And, also that it takes precedence over other potential things?

I guess h* stuff on definitions is ignored now. Correct? Which has some sense.

But perhaps even more sensical would be, to treat it like other nodes: if you define a data.hName: 'div' on it, that will render one <div> for that definition? 🤔

Any other unknown node with h* fields gets rendered as what they describe, so why not definition?

What about footnote definitions? What if theoretical other syntax extensions define similar things? 🤔 Say, abbreviations?

ben519 commented 3 months ago

But, given that you make your own transform, where you tried to set things on definitions and then found it didn’t work, is it then not rather simple to change that transform? To work on link references?

I believe it would be possible, but it would require additional expensive tree scanning. When I visit each definition node, I'd then have to search for every corresponding reference node. The imageReference handler is already doing the expensive lookup operation, so it made sense for me to put the logic there.

You bring up some fair questions. I do have a solution for now which is to overwrite the imageReference handler with a slight tweak.

export function imageReference(state, node) {
  const id = String(node.identifier).toUpperCase()
  const def = state.definitionById.get(id)

  if (!def) {
    return revert(state, node)
  }

  /** @type {Properties} */
  const properties = { src: normalizeUri(def.url || ""), alt: node.alt }

  if (def.title !== null && def.title !== undefined) {
    properties.title = def.title
  }

  /** @type {Element} */
  const result = { type: "element", tagName: "img", properties, children: [] }
  state.patch(node, result)

  // Apply data, first from node and then from def
  // This is the only difference from the original version of this function
  // https://github.com/syntax-tree/mdast-util-to-hast/blob/main/lib/handlers/image-reference.js
  return state.applyData(def, state.applyData(node, result))
}

The downside for me is that now I have to keep track of any changes you make to imageReference() and update my handler accordingly. But I can live with that.

Thanks for the quick reply as always.

remcohaszing commented 3 months ago

I like the idea of supporting video URLs like that in markdown. This needs to support link references, and at first this request made sense. However https://github.com/remarkjs/remark-rehype/issues/37#issuecomment-2140431865 convinced me it’s probably not a good idea.

Then I realized, this is an HTML transform, not a markdown transform. The functionality you want is probably best implemented as a rehype plugin that transforms:

<img
  alt="Video alt"
  src="./video.mp4"
  title="Video title"
  other-attribute="for-example class or id"
>

into:

<video title="Video title" other-attribute="for-example class or id">
  <source src="./video.mp4" type="video/mp4"></source>
  <p>Video alt</p>
</video>

In that case markdown link references become an implementation detail you no longer have to deal with.

wooorm commented 2 months ago

@ben519 Does a rehype plugin work for you? Remco’s post seems convincing to me.

Then we can close this issue for now—we can always revisit when there’s a more practical example of data on definitions when staying in markdown?

ben519 commented 2 months ago

Not really because the HTML <img> tag will be missing properties that are included on the mdast link reference (e.g. height, width, etc.). But feel free to close the issue. My override of imageReference is a fine patch for the time being.

wooorm commented 2 months ago

Hmm, can you elaborate? Those weren’t in your OP here. Data should be carried over from mdast through to hast (around here: https://github.com/syntax-tree/mdast-util-to-hast/blob/f511a93817b131fb73419bf7d24d73a5b8b0f0c2/lib/handlers/image.js#L35)

ben519 commented 2 months ago

If I recall correctly, the situation goes like this.. There are two MDAST nodes (a link and a link-reference) that essentially get merged into a single HAST node (the <img>). This logic happens inside imageReference() where the link reference node is called def.

Given that premise, how should the nodes' properties be merged? For example, what if the link has class A and the link reference has class B? What if they each define a height and width property? Which properties should the resulting <img> have?

My understanding is that the current merge behavior is to ONLY use the url and title from the link reference node. Any other properties on the link reference node will be completely ignored. (My argument is that they shouldn't be ignored, but I can see both sides..)

Does that make more sense?

ChristianMurphy commented 2 months ago

My understanding is that the current merge behavior is to ONLY use the url and title from the link reference node. Any other properties on the link reference node will be completely ignored. (My argument is that they shouldn't be ignored, but I can see both sides..)

References only have url and title properties, because that is all that can exist in markdown. The only way other properties could exist on it is if they are being added by a plugin, and that plugin could just as easily target the link.

In my mind there is no merging, and probably shouldn't be.

ben519 commented 2 months ago

@ChristianMurphy

Yes and in my case I am adding properties via a plugin. I'm reluctant transfer the properties between nodes inside that plugin because doing so requires an expensive lookup operating (i.e. finding all the matching link references.) This lookup is happens inside the imageReference() handler, so it makes sense for me to "piggy-back" it to add the properties there.

ChristianMurphy commented 2 months ago

I'm reluctant transfer the properties between nodes inside that plugin because doing so requires an expensive lookup operating (i.e. finding all the matching link references.)

It's not expensive, it can be done in a single pass of the AST and a very efficient Map data structure. Something along the lines of

import { Plugin } from 'unified';
import { Node } from 'unist';
import { visit } from 'unist-util-visit';
import { Parent, LinkReference, Definition } from 'mdast';

// Define an interface for link references
interface LinkReferenceWithParent {
  node: LinkReference;
  index: number;
  parent: Parent;
  definition?: Definition;
}

const remarkExampleForBen: Plugin = () => {
  return (tree: Node) => {
    const definitions = new Map<string, Definition>();
    const linkReferences: LinkReferenceWithParent[] = [];

    // Single pass: collect all definitions and link references
    visit(tree, (node, index, parent) => {
      if (node.type === 'definition') {
        const definition = node as Definition;
        definitions.set(definition.identifier.toLowerCase(), definition);
      } else if (node.type === 'linkReference') {
        const linkReference = node as LinkReference;
        if (parent && typeof index === 'number') {
          linkReferences.push({ node: linkReference, index, parent });
        }
      }
    });

    // Fill in the definition from the map
    const mappedLinkReferences = linkReferences.map(reference => {
      const definition = definitions.get(reference.node.identifier.toLowerCase());
      if (definition) {
        reference.definition = definition;
      }
      return reference;
    });

    mappedLinkReferences.forEach(({ node, index, parent, definition }) => {
      // do any processing you need
    });
  };
};

export default remarkExampleForBen;
ben519 commented 2 months ago

@ChristianMurphy I understand what your saying and really appreciate the implementation example. I think our disagreement lies here

It's not expensive, it can be done in a single pass of the AST

I thought that scanning the entire AST, even just once, is expensive. And if it can be avoided, it should be avoided.

To be fair, you might be right and maybe it's not as expensive as I'm guessing. I think we'd need to run some timings on example data to know for sure. I don't have the time or motivation to do it right now; maybe another day.

Anyways, as long as I can override the imageReference() handler, I should be fine. I don't want to take up your time with this when I have a solution that works for me. I really appreciate all the help y'all give here. 👍

ChristianMurphy commented 2 months ago

Premature optimization is the root of all evil 😅

Visiting the AST is fast on a reasonably sized document. Scaling laws still apply, large documents or excessive repeated visits can add up to become slow. If you don't have real users complaining about slowness or numbers that concern you, it's probably not a problem.

wooorm commented 2 months ago

If I recall correctly, the situation goes like this.. There are two MDAST nodes (a link and a link-reference) that essentially get merged into a single HAST node (the <img>). This logic happens inside imageReference() where the link reference node is called def.

Basically, a bit nit picky but I’d say: the first sentence is correct but does not conclude to the second sentence. There are 2 mdast nodes. 1 mdast node is indeed mapped there. The other mdast node is mapped somewhere else.

Given that premise, how should the nodes' properties be merged? For example, what if the link has class A and the link reference has class B? What if they each define a height and width property? Which properties should the resulting <img> have?

I don’t think this is relevant: you are talking about custom h* data fields that you set here instead of regular fields, which in that one place in imageReference(), are all passed through to the <img>.

If you can set h* fields on mdast imageReference, you can also set fields on hast img elements?

My understanding is that the current merge behavior is to ONLY use the url and title from the link reference node. Any other properties on the link reference node will be completely ignored. (My argument is that they shouldn't be ignored, but I can see both sides..)

There are arguments to be made for several potential changes: your proposed change, or turning a definition with h* data fields itself into a div, or doing nothing. There are different ways to go. If your current code is working with mdast/remark when it should be better/cleaner/simpler/faster in hast/rehype, we should likely first solve this root problem before making changes that nobody needs.

Hence my question: how are you getting height/width from image (not imageReference) and also definition? Why is this done in mdast/remark, would it work in hast/rehype?

I have a feeling you want as few plugins as possible. But then you already have a plugin as you are setting data.h*? Why is changing one remark plugin to a rehype plugin not an option?

ben519 commented 2 months ago

To provide a bit more context, I wanted to support user-defined properties for links and images like this

![](myimg.png){ height=100 width=100 .border .border-white}

and this

![][dog]{ height=100 width=100 .border .border-white}

<!-- References -->
[dog]: myimg.png

(My users are already familiar with this curly-brace syntax because that's how you pass properties into directives, and I use a lot of them.)

I made the following remark plugin to handle these types of attributes:

import { SKIP, visit } from "unist-util-visit"

const nodeTest = (node) => {
  return (
    node.type === "text" &&
    node.value.startsWith("{") &&
    node.value.includes("}")
  )
}

export default function remarkAttributes() {
  // This plugin handles non-standard markdown attributes
  // The best example is when we want to support CSS classes on images
  // The input markdown looks like this
  //
  // ![](myimg.png){.w-96}
  //
  // This input is parsed into mdast nodes like this
  // {
  //   type: 'paragraph',
  //   children: [
  //     {
  //       type: 'image',
  //       url: 'myimg.png',
  //       position: {...}
  //     },
  //     {
  //       type: 'text',
  //       value: '{.w-96}',
  //       position: {...}
  //     }
  //   ],
  //   position: {...}
  // },
  //
  // This plugin parses the attributes in the image's sibling node and assigns them as
  // hProperties of the image node, like this
  //
  //     {
  //       type: 'image',
  //       url: 'myimg.png',
  //       data: {hProperties: {className: ['.w96']}}
  //       position: {...}
  //     }
  //
  // remarkRehype will then apply the classNames to the html element

  return (tree) => {
    visit(tree, nodeTest, (node, index, parent) => {
      // Exit early if this node is not the next sibling of an image node
      if (index === 0) return

      // Get the previous sibling
      const prevSibling = parent.children[index - 1]

      // Exit early if the "{...}" isn't on the same line and immediately after the image/link/etc.
      if (
        prevSibling.position.end.line !== node.position.start.line ||
        prevSibling.position.end.column !== node.position.start.column
      ) {
        return
      }

      // Exit early if the previous sibling isn't an image or a link
      if (
        !["image", "imageReference", "link", "linkReference"].includes(
          prevSibling.type
        )
      )
        return

      // Get everything between the first set of braces {...}
      const value = node.value
      const innerValue = value.match(/^{(.*?)}/)[1]

      // Parse the properties
      const tokens = innerValue.split(" ").filter((x) => x !== "")
      if (tokens.length === 0) return

      // Apply them to the image node as hProperties
      const data = prevSibling.data || (prevSibling.data = {})
      const hProperties = data.hProperties || (data.hProperties = {})
      const className = hProperties.className || (hProperties.className = [])

      tokens.forEach((x) => {
        // We expect three forms of tokens:
        // .class     (e.g. .text-red-500)
        // key=value  (e.g. target="_blank")
        // key        (e.g. autoplay)

        if (/^\.[\w\.\/-]+/.test(x)) {
          className.push(x.substring(1))
        } else if (/^[^=]+=[^=]+$/.test(x)) {
          const { key, val } =
            /^(?<key>[^=]+)=(?:['"]+)?(?<val>[^=]+?)(?:['"]+)?$/.exec(x).groups
          hProperties[key] = val
        } else {
          hProperties[x] = true
        }
      })

      if (hProperties.className.length === 0) delete hProperties["className"]

      // Update this node's value by chopping off the "{...}" bit
      node.value = node.value.slice(innerValue.length + 2)

      // If the remaining string is empty, delete this node
      if (node.value === "") {
        parent.children.splice(index, 1)
        return [SKIP, index]
      }
    })
  }
}

I think it's clear at this point that this is some custom functionality that only I need. So, feel free to close the issue and stop wasting brain cells on it 😂

wooorm commented 2 months ago

Your code looks fine, what I don’t understand is how it leads to this issue? The issue is about definitions. Your code does not handle definitions? Your users don’t annotate definitions?

ben519 commented 2 months ago

Are you suggesting that instead of this...

![][dog]{ height=100 width=100 .border .border-white }

<!-- References -->
[dog]: myimg.png

I do this

![][dog]

<!-- References -->
[dog]: myimg.png { height=100 width=100 .border .border-white }

?

I think I may have tried something like this a while back and this format prevents the parser from recognizing the definition. (I could be wrong). It's also a weird UI design, in my opinion, to tack the attributes onto the definition in this way. In some cases, it could be hard to tell where the URL ends and the properties begin.

wooorm commented 2 months ago

No, I am saying that I don‘t understand why there is any issue. Everything already works?

Your markdown:

![][dog]{ height=100 width=100 .border .border-white }

[dog]: myimg.png

Plus your plugin. Without any changes to this project. Is fine?

ben519 commented 2 months ago

Yes, my code works thanks to my override of the imageReference() handler that I mentioned above.

wooorm commented 2 months ago

I don’t see how.

Your plugin does not touch definitions. Your custom renderer takes things from definitions. There is nothing to take from definition if you never set anything?

Your plugin touches imageReference. It sets data on them. That data is already handled by the default handler: https://github.com/syntax-tree/mdast-util-to-hast/blob/main/lib/handlers/image-reference.js

wooorm commented 2 months ago

Could you perhaps please make a minimum viable reproduction of the exact dependencies and code you have?

ben519 commented 2 months ago

Alrighty, I posted a reprex here. During the process, I remembered some of the reasoning behind why I did what I did, and I also understand your confusion better..

In my example, the input markdown looks like this

![][dog]{ height=100 }

[dog]: foo.mp4

My desire is to have the output look like this

<p><video src="http://127.0.0.1:9199/foo.mp4" alt="" height="100" controls></video></p>

However, there are a few things that I also need...

  1. I need to take inventory of all the images & videos referenced in the markdown. I have a remark plugin that does this (remarkFiles) and it stores the results on the VFile.
  2. My users can reference file names like "foo.png" in their Markdown (as shown in the example above). I need to turn these into fully qualified urls. I do this in the same plugin, remarkFiles, by passing through a map of filename -> url pairs. (Technically it's just a JavaScript object, not a Map type.)
  3. Not included in my example is another remark plugin that allows users to put content behind a paywall. The paywall essentially deletes a chunk of content. Users declare paywalls in Markdown like this
# My football picks!

:::paywall
1. Manchester
2. Real Madrid
3. Richmond
:::

So, I think I decided to place my remarkFiles plugin before my remarkPaywall plugin, since I need to take inventory of all the referenced images and some of them might get chopped off by the paywall. But I may have a different solution here..

This is all very complex and specific to my needs, haha. You sure you want to go down this rabbit hole? Anyways, let me know if you need help reading my code.

Cheers

Oh, and just as an aside, if you want to see what my platform looks like and how it works, you can sign up for free or check out these youtube videos I made on it.

wooorm commented 2 months ago

Thanks for explaining more of the macro problem you’re facing, and providing all the code!

So, I think I decided to place my remarkFiles plugin before my remarkPaywall plugin, since I need to take inventory of all the referenced images and some of them might get chopped off by the paywall. But I may have a different solution here..

I recommend:

  1. running your remark-attributes.js plugin
  2. turning the mdast for :::paywall into hast for <my-paywall>; you can do this in the handler for remark-rehype, turn the directive node into a custom element node
  3. as a rehype plugin, do 50% of what is currently remark-files.js, namely the <img> -> <video> part; for inspiration, see the similar https://github.com/rehypejs/rehype-picture/blob/main/lib/index.js
  4. as a rehype plugin, do the other 50% of what is currently remark-files.js, namely the proxy/asset/url rewriting. For that, take a look at https://github.com/rehypejs/rehype-minify/tree/main/packages/html-url-attributes, which lists the properties and elements which constitute URLs; also, take a look at https://github.com/rehypejs/rehype-minify/blob/main/packages/rehype-minify-url/lib/index.js, which uses the former html-url-attributes package, you’d want to do something similar
  5. finally, handle the <my-paywall> elements somewhere; with client side JS/CSS for example; or if you’d use something like MDX or rehype-react at the end, you can overwrite it with a component; or another rehype plugin to transform it into different HTML elements; Or, alternatively, why not generate a <details> at the start instead?

This is all very complex and specific to my needs, haha. You sure you want to go down this rabbit hole? Anyways, let me know if you need help reading my code.

Not that complex! ;) There are some things I (and remco/christian probably too) would do differently in your code. It looks quite :+1: though. But feel free to open a discussion if you want a code review! Also, look at the existing plugins by us, they’re well-typed and battle tested!


I’ll close this for now. I think your root problems are better solved differently (moving to rehype). There are also viable ways to work around this (a custom handler). There could be a change, but there are 3 different changes we could make. To choose one, I’d want a good concrete problem that would be solved well by it. If one such problem surfaces, we can reopen!