syntax-tree / hast-util-sanitize

utility to sanitize hast nodes
https://unifiedjs.com
MIT License
48 stars 20 forks source link

id property is always removed when using <h2> tag #27

Closed ben519 closed 8 months ago

ben519 commented 10 months ago

Initial checklist

Affected packages and versions

5.0.0

Link to runnable example

No response

Steps to reproduce

This is a weird one as <h1>, <h3>, <h4>, ... tags all work. It's just <h2> that has a problem.

In the below code, I'm attempting to sanitize the html string <h2 id="foo">Hello, world!</h2>. I want the id to be retained (or at least sanitized into user-content-foo). However, it is removed entirely.

import deepmerge from "deepmerge"
import { defaultSchema } from "hast-util-sanitize"
import rehypeParse from "rehype-parse"
import rehypeSanitize from "rehype-sanitize"
import rehypeStringify from "rehype-stringify"
import { unified } from "unified"

const schema = deepmerge(defaultSchema, { attributes: { "*": ["id"] } })

const file = await unified()
  .use(rehypeParse)
  .use(rehypeSanitize, schema)
  .use(rehypeStringify)
  .process('<h2 id="foo">Hello, world!</h2>')

console.log(String(file))  // <h2>Hello, world!</h2>

Expected behavior

<h2 id="user-content-foo">Hello, world!</h2>
(The id property should not be removed.)

Actual behavior

<h2>Hello, world!</h2>
(The id property is removed.)

Affected runtime and version

node v20.5.1

Affected package manager and version

npm 10.0.0

Affected OS and version

mac os 13.5.2 (22G91)

Build and bundle tools

No response

wooorm commented 10 months ago

ah, it’s because https://github.com/syntax-tree/hast-util-sanitize/blame/fafff8ac1073414ef7babb6063ed9bec9b2f2cb6/lib/schema.js#L39 is now used by GH. Interesting.

ben519 commented 10 months ago

Oh, interesting. So, <h2>s can only have the id "footnote-label"? Seems oddly restrictive. Can you propose a fix or a workaround for me? I need to allow any id. Thanks!

wooorm commented 10 months ago

You can change the h2 stuff in your schema?

The default schema is that way because that's the default on github.

ben519 commented 10 months ago

Oh I see what you're saying. Something like this

const schema = deepmerge(defaultSchema, { attributes: { "*": ["id"] } })
schema["attributes"]["h2"] = ["id", "className"]
wooorm commented 10 months ago

@tommyvn responding here from the other issue, where you said:

Am I reading that correctly, that h2 is restricted to a specific id because github uses it in this way? that's wild. anyway, this answers my question, sounds like it's a feature (for GH at least) not a bug, so i'm closing this one, I'll follow the advice on that issue to use my own schema. thanks for the link :)

The scheme, the default handling, is the feature indeed! I do think this should be solved though. But not 100% on how yet? It probably makes sense that if a h2 only allows certain IDs, and a * allows all IDs, that practically on h2 also all IDs are allowed. (right?1) But it all has to be secure, and I might be missing a case where that wouldn’t be the case?

So: there’s a nice workaround as @ben519 shows above, though it probably should be fixed, so if someone wants to do the work and spend some time on it, that’d be appreciated!

ben519 commented 10 months ago

I agree. It's extra confusing that <h2> doesn't work but <h1>, <h3>, ... do.

I'm not sure what the optimal solution is, but I think it'd be great to call out https://github.com/syntax-tree/hast-util-sanitize/blob/main/lib/schema.js in the docs (README). Would've saved me a lot of headaches debugging.

wooorm commented 10 months ago

I’d appreciate a PR linking it directly in the docs! Probably here https://github.com/syntax-tree/hast-util-sanitize?tab=readme-ov-file#defaultschema

ben519 commented 10 months ago

Will do, but give me a few days to get to it.

github-actions[bot] commented 8 months ago

Hi! This was closed. Team: If this was fixed, please add phase/solved. Otherwise, please add one of the no/* labels.