improve script/style extract

dominikg commented 9 months ago

Describe the problem

svelte uses regex to extract information about script and style nodes in .svelte files

Recent changes increased the complexity of these regex to improve their accurary and allow for ankle brackets in attributes eg generics="<...>"

https://github.com/sveltejs/svelte/blob/9926347ad9dbdd0f3324d5538e25dcb7f5e442f8/packages/svelte/src/compiler/preprocess/index.js#L257

These regexes for script and style tags also extract the content to be passed to svelte.preprocess.

Due to their complexity, their performance degrades when processing larger files and they are hard to read/maintain (case in point, https://github.com/sveltejs/svelte/pull/9551 )

They are also replicated in other svelte tooling that also needs access to this information without running a full svelte.parse.

Describe the proposed solution

I'd like to suggest a few improvements.

1) extract these into a utility either exported from svelte or a new utility library

2) add more tests that check all possible combinations. Eg the current implementation does not allow whitespaces other than space after <script or any whitespace preceding the closing bracket in </script >

Fixing those would make the regex slower and more complex again, but i think we can make reasonable limitations to what a svelte component should be allowed to use for declaring script and style blocks. Which opens up an entirely different avenue:

3) replace regex with a string walking extractor that explicitly only seeks top level scripts at the beginning of .svelte and style blocks at the end. This allows us to skip the entire template block in the middle.

The resulting extract code would be a bit larger than a regex, but much safer and with better performance characteristics (speed and memory). A super simple start can be found here https://jsben.ch/ODOHu

For backwards compatibility, we could offer a fallback to either full parse or the previous regex approach. but most projects using prettier-plugin-svelte with script-template-style ordering should have no issue with it.

Alternatives considered

rethink parsing entirely and make this extract no longer needed
leave as is

Importance

would make my life easier

flakolefluk commented 8 months ago

Might be related. I've found this bug today, where parsing the script section breaks when declaring a string with a closing script tag.

dominikg commented 8 months ago

no, thats just how html parsing works

flakolefluk commented 8 months ago

Oh got it, when I saw that extraction in the title and the description I thought that the script was extracted with a regex and then there was JS parsing on the content of the script tag. That would've made the content valid for parsing, but the component still broke. Thanks for clarifying.

dominikg commented 6 months ago

Here is a basic library that implements what i outlined in the initial post

https://github.com/svitejs/sfcsplit

You can tell it the tags you are interested in (useful for custom tags like template used in svelte-preprocess for multi-file-components) and it will parse from start and end, so parse speed is not affected by the size of the template.

Its algorithm works for valid html syntax, including whitespace in tags, linebreaks between attributes and self-closing tags. It returns the tags, their position and content as well as all parsed attributes while being at least 2x faster than the current regex in a benchmark parsing meltui and carbon components.

One caveat is that blocks parsed at the end of the file cannot contain their own opening tag as comment or string literal, eg

<div>red text<div>
<style>
div { color: red }
/* <style>  this is bad */
</style>

or

<div>{openingScriptTag}<div>
<script>
const openingScriptTag = "<script>"; // this is bad too
</script>

The benefits of not having to look at all the template nodes in the middle of the file outweigh this limitation in my opinion, code like that should be exceedingly rare and it is a smaller limitation than the svelte4 regex approach that has more false positives for nested script blocks.

vojtechsimetka commented 2 months ago

no, thats just how html parsing works

Please correct me if I am wrong - I believe the content between script in svelte is not HTML, but much closer to nodejs where const foo = "</script>" is valid literal string. Shouldn't this be considered svelte compiler bug?

Given following example:

<script lang="ts">
    const content = `
       <script>
          import Button from '$lib/Button.svelte';
        </script>
        <Button>Button</Button>
    `
</script>

<h1>Usage</h1>
<pre>{content}</pre>

I get a svelte preprocessing error:

Internal server error: Error while preprocessing /src/routes/+page.svelte - Transform failed with 1 error:
/src/routes/+page.svelte:5:8: ERROR: Unterminated string literal
  Plugin: vite-plugin-svelte
  File: /src/routes/+page.svelte

   Unterminated string literal
   3  |         <script>
   4  |            import Button from '$lib/Button.svelte';
   5  |          
      |          ^

{
  name: 'Error',
  id: '/src/routes/+page.svelte',
  message: 'Error while preprocessing /src/routes/+page.svelte - Transform failed with 1 error:\n' +
    '/src/routes/+page.svelte:5:8: ERROR: Unterminated string literal',
  frame: ' \n' +
    ' \x1B[33mUnterminated string literal\x1B[39m\n' +
    ' 3  |         <script>\n' +
    " 4  |            import Button from '$lib/Button.svelte';\n" +
    ' 5  |          \n' +
    '    |          ^\n' +
    ' ',
  code: undefined,
  stack: '',
  plugin: 'vite-plugin-svelte',
  pluginCode: '<script lang="ts">\n' +
    '\tconst content = `\n' +
    '       <script>\n' +
    "          import Button from '$lib/Button.svelte';\n" +
    '        </script>\n' +
    '        <Button>Button</Button>\n' +
    '    `;\n' +
    '</script>\n' +
    '\n' +
    '<h1>Usage</h1>\n' +
    '<pre>{content}</pre>\n'
}

The issue can be resolved by escaping the closing script tag like this </script\> (or <\/script>). But then linter with default sveltekit rules complains.


  5:17  error  Unnecessary escape character: \>  no-useless-escape

✖ 1 problem (1 error, 0 warnings)

 ELIFECYCLE  Command failed with exit code 1.
 ```

MotionlessTrain commented 2 months ago

If you would try this in vanilla html and JavaScript, it breaks the same way

(I tested this by adding the following to a regular html pagina, not in a svelte project)

<script type="text/javascript">
  const content = `
    <script>
      import Button from 'Button.svelte'
    </script>
  `
  document.addEventListener('DOMContentLoaded' () => document.getElement
ById('pre').innerText = content
</script>
<pre id="pre"></pre>

This gives (in chromium): Uncaught syntax error: unexpected end of input (pointing at the in the template literal) So Svelte is consistent with how html parsing works in the browser

vojtechsimetka commented 2 months ago

Thanks for the example. It's not consistent in a way that it breaks at compile time and not runtime. The expectation on my side is that svelte pre-processes without any issue and then fails at runtime. However, I do agree that failing during compilation is better DEX.

Then the follow up question is, how to properly escape such code? Should it be added as a rule to eslint-plugin-svelte?

MotionlessTrain commented 2 months ago

E.g. the following could work:

const script = 'script'

const content = `<${script}/>`

dominikg commented 2 months ago

please stick to the issue at hand, how html works can be discussed elsewhere.

sveltejs / svelte