HTML character entities throw "Expected valid tag name" error

vwkd commented 2 years ago

Description

When using HTML character entities like < in a MDsveX page with SvelteKit, Vite throws an Expected valid tag name error

(Note, before Vite 2.5.4 and SvelteKit 1.0.0-next.165 the error was a cryptic offset is longer than source length!. Thanks to https://github.com/vitejs/vite/pull/4782 this is now fixed.)

Reproduction

https://github.com/vwkd/sveltekit-bugs/tree/mdsvex-character-references

Run npm run dev
Observe error Expected valid tag name

System Info

  Binaries:
    Node: 16.6.0 - ~/.nvm/versions/node/v16.6.0/bin/node
    Yarn: 1.22.11 - ~/.nvm/versions/node/v16.6.0/bin/yarn
    npm: 7.19.1 - ~/.nvm/versions/node/v16.6.0/bin/npm
  npmPackages:
    @sveltejs/adapter-static: ^1.0.0-next.11 => 1.0.0-next.18
    @sveltejs/kit: next => 1.0.0-next.165
    mdsvex: ^0.9.8 => 0.9.8 
    svelte: ^3.34.0 => 3.42.5

Additional context

vwkd commented 2 years ago

Ok, I might have found a possible cause. After testing multiple pages with such symbols, I sometimes got the more specific error Expected valid tag name. Looking at the error message, the character entities were converted. It seems like they were interpreted as HTML, which in the case of < and >would lead to the invalid tag. This seems like it is a bug.

Though there might be more causes to the "offset is longer than source length" error. I also got it when writing some LaTeX on a page, e.g.\{0,1,\ldots,9\}. Sometimes I got the more specific error Expecting Unicode escape sequence \uXXXX. Of course, LaTeX is invalid markup and an error isn't a bug here, but again the cryptic "offset is longer than source length" doesn't help much.

Both examples seems to highlight the same issue that invalid markup isn't reported properly as error, then it progresses down the chain until it eventually ends up being served by Vite which throws the cryptic "offset is longer than source length" error.

Could it be that remark fails to throw an error during parsing, or that MDsveX swallows it?

milahu commented 2 years ago

one problem is that vite throws these usless stack traces

Error: offset is longer than source length!
    at numberToPos (node_modules/vite/dist/node/chunks/dep-972722fa.js:4247:15)
    at formatError (node_modules/vite/dist/node/chunks/dep-972722fa.js:51169:24)
    at TransformContext.error (node_modules/vite/dist/node/chunks/dep-972722fa.js:51149:19)
    at Object.transform (node_modules/vite/dist/node/chunks/dep-972722fa.js:51357:25)
    at async transformRequest (node_modules/vite/dist/node/chunks/dep-972722fa.js:67098:29)
    at async instantiateModule (node_modules/vite/dist/node/chunks/dep-972722fa.js:73732:10)

with https://github.com/vitejs/vite/pull/4782 i get

formatError: error in numberToPos (Error: offset is longer than source length! offset 126 > length 48) while handling the error:
Expected valid tag name
4: </script>
5: 
6: <p>Lorem ipsum < dolor sit</p>
                   ^
ParseError: Expected valid tag name
    at error (node_modules/svelte/compiler.mjs:17698:19)
    at Parser$1.error (node_modules/svelte/compiler.mjs:17774:9)
    [...]

milahu commented 2 years ago

digging deeper

node_modules/mdsvex/dist/main.cjs.js

```js // line 8722 if (namedEntity) { entityCharacters = characters; entity = namedEntity; } // DEBUG console.log(`\n\nmdsvex cjs 8727: found entity in ${characters}. named ${namedEntity}. type ${type}`); //console.dir(new Error()); ``` ```js // line 8815 // Found it! // First eat the queued characters as normal text, then eat an entity. if (reference) { flush(); prev = now(); index = end - 1; column += end - start + 1; result.push(reference); // DEBUG console.log(`mdsvex cjs 8827: push reference ${reference}`); ```

output

mdsvex cjs 8727: found entity in lt. named <. type named
mdsvex cjs 8827: push reference <

where does this code come from? lets ask github - looks like parse-entities some traces later, this is called by remark-parse.markdown

so the problem is ...

    const toMDAST = unified()
        .use(markdown) // HERE

simple but wrong: escaping < with &lt; and > with &gt; would give false positives for example

&lt;
<div title="&lt;"/>

would become

&lt;
<div title="&amp;lt;"/>

so either set options via .use(markdown, options) or patch the markdown parser

milahu commented 2 years ago

quickfix

// cjs js line 8716

      if (terminated) {
        end++;

        namedEntity = type === name$1 ? decodeEntity_1(characters) : false;

        // QUICKFIX
        if (namedEntity == '<') namedEntity = '&lt;';
        if (namedEntity == '>') namedEntity = '&gt;';

        if (namedEntity) {
          entityCharacters = characters;
          entity = namedEntity;
        }

much faster than fixing all the dependencies : ) how to apply with patch-package

git clone --depth 1 https://github.com/milahu/sveltekit-bugs.git --branch quickfix-with-patch-package
cd sveltekit-bugs
pnpm install
npm run dev

pngwn commented 2 years ago

Late coming to this but I don't think < or > are really supported in the underlying parsers, entities seem to be decoded when parsed.

I think there are two issues about this now, can you check to see if this issue is reporting the same problem?

milahu commented 2 years ago

can you check to see if this issue is reporting the same problem?

yes, its a duplicate

source here

Lorem ipsum &lt; dolor sit

source there

A paragraph started with &lt;h2&gt;

with my quickfix, this is correctly rendered as

A paragraph started with <h2>

pngwn commented 2 years ago

Thank you!

In v1, I have a plan to make the use of <, > character entities unnecessary, supporting < and > directly in text. But the entity situation will be resolve when mdsvex moves to a different parser.

pngwn commented 2 years ago

This issue has more information, I'll close the other.

milahu commented 2 years ago

when mdsvex moves to a different parser.

in the mean time, you could use patch-package to patch the dependency remark-parse.markdown so that your consumers dont need to patch mdsvex

vwkd commented 2 years ago

@milahu Thanks you for digging into this and also for fixing the error message in Vite!

I'm thinking it may be a good idea to submit a PR upstream to remark-parse.markdown. I have the feeling it will take some time for MDsveX to move to a new parser, so it's waiting either way. Fixing upstream would be the right thing to do even if we have to wait more. This way MDsveX doesn't need to maintain a custom patch and also other projects would benefit from it.

EDIT: Are you sure it's remark-parse? It seems it just a wrapper around mdast-util-from-markdown. Looking into mdast-util-from-markdown, it has a single reference to parse-entities at https://github.com/syntax-tree/mdast-util-from-markdown/blob/c79a430b8f6b0e3aa4c0981459510283ffe037a0/dev/lib/index.js#L960. Might that be it?

milahu commented 2 years ago

Might that be it?

yepp

$ cd MDsveX/packages/mdsvex
$ sed -i 's/sourcemap: false/sourcemap: true/' rollup.config.js
$ npm run build
$ npm i -g source-map-cli
$ source-map resolve dist/main.cjs.js.map 8722 0
Maps to node_modules/parse-entities/index.js:254:0 

        if (namedEntity) {
^

thats the problem with fixing upstream: the code is hidden in some dependency, and we would have to pass our option across all the interfaces

do you have so much time to waste? ; )

pngwn commented 2 years ago

I'm not sure an upstream PR would be accepted anyway, I'm pretty sure this is by design from previous conversations I've had.

I'll Patch this dependency for the short term when I find the time (if anyone wants to submit a PR, I'll happily accept it).

pngwn / MDsveX