mdx-js / mdx

Markdown for the component era
https://mdxjs.com
MIT License
17.71k stars 1.14k forks source link

Ability to compile MDAST (instead of Markdown) to JSX #2529

Closed slorber closed 2 months ago

slorber commented 2 months ago

Initial checklist

Problem

Compiling many MDX documents can be expensive, in particular for an MDX-based static site generator like Docusaurus.

According to profiling, 50-70% of that time is spent on remark-parse, which as far as I understand has been reimplemented in Rust and could help us improve performances:

It looks like the Rust port also supports popular syntax extensions like math and gfm (but not yet directives), so the Rust ecosystem probably has all it needs (apart directives) for parsing Markdown for most Docusaurus users.

Problem: for flexibility and retro-compatibility, we'd like to keep the ability to write remark/rehype plugins in JS, but they are not supported in Rust.

Solution

Use Rust for parsing, and JS for the rest?

What I'd like to be able to do is something like this:

async function compileDoc(mdxString)
  const mdast = await parseMdastInRust(mdxString);
  const jsx = await compileMdx(mdast,{noParse: true});
  return jsx;
}

Does it make sense?

Would this improve performance or it's not worth it (due to js<->rust cost?)

Could it work if we added a noParse option that removes remark-parse from the pipeline?


Not 100% related but we are also compiling each MDX doc twice currently during the bundling phase in Webpack loaders, for client/server environments (and maybe soon RSC env too?).

Problems:

I'd also like to:

async function build() {

  // ... many other things
  const mdxFiles = await getMdxFiles();

  // warmup mdx files parsing/compilation
  // prepopulate caches earlier, before bundling / mdx-loader
  await Promise.all(mdxFiles.map(warmup));

  // ... many other things
  await compile([clientConfig,serverConfig]);
  // ... many other things
}

I hope it makes sense?

Alternatives

Keep using the MDX JS implementation, but it will likely become the main Docusaurus bottleneck once we migrate to modern tools (Rspack, LightningCSS...) according to my profiling.

wooorm commented 2 months ago

According to profiling, 50-70% of that time is spent on remark-parse, which as far as I understand has been reimplemented in Rust and could help us improve performances:

I doubt 70% of docusaurus time from string -> rendered react is parsing markdown. I’d be rather sure that it’s instead spent in lots of AST transforms. Or other work that has alternatives, e.g., is rehype-raw there? Can you cut directives, you have JSX already?

Use Rust for parsing, and JS for the rest?

I don’t believe in this — that it’s faster. Rust being theoretically faster will be lost by using two languages.

Would this improve performance or it's not worth it (due to js<->rust cost?)

Indeed.

Could it work if we added a noParse option that removes remark-parse from the pipeline?

Everything’s possible — I seriously doubt it’s a good idea though.

Not 100% related but we are also compiling each MDX doc twice currently during the bundling phase in Webpack loaders, for client/server environments (and maybe soon RSC env too?).

Looks like you already know how to improve performance: not do useless work. Why are you doing things twice?

create a cross-loader cache to avoid duplicate compilation (probably reviving the deprecated cache-loader)

What? What is the cache-loader?

add some code to pre-parse MDX files ahead of time so that when bundling the cache is already pre-filled with MDX parsing result

Of course, makes lots of sense to cache things if you do repeated work!


It’s the same as TS? You compile TS to JS. You don‘t compile things every time you run them?

wooorm commented 2 months ago

Also, you can: createProcessor(options?).

Something along the following pseudocode:

import {createProcessor} from '@mdx-js/mdx'

const tree = getMdastRootFromSomewhere()

const processor = createProcessor()
const transformedTree = await processor.run(tree)
const result = processor.stringify(transformedTree)

console.log(result)
ChristianMurphy commented 2 months ago

Hey @slorber! Good to hear from you!

According to profiling, 50-70% of that time is spent on remark-parse

There are still more performance optimizations to be realized in remark-parse. If you're interested in diving into those profiles, there are likely hotspots that can be improved.

which as far as I understand has been reimplemented in Rust and could help us improve performances

Potentially. If storybook is considering leveraging more rust, it could be ideal to consider something like rspack or the next generation of vite. So that the mdx compilation, transpiling of js, and bundling all happen in rust. Which from your later comment already seems to be planned

once we migrate to modern tools (Rspack, LightningCSS...) according to my profiling.


Would this improve performance or it's not worth it (due to js<->rust cost?)

Crossing the boundary between rust and JS has trade offs depending on:

My gut feeling is crossing right after parsing markdown is early and large enough it may not realize much benefit. But I don't have data to back that up, it would take benchmarking.

we'd like to keep the ability to write remark/rehype plugins in JS, but they are not supported in Rust.

If you're interested, plugin support in the rust implementation is on the roadmap https://github.com/wooorm/markdown-rs/issues/32 insights and assistance are welcome!

slorber commented 2 months ago

Thanks for your feedback @wooorm, I'll try to profile better to see if I was wrong but trust you to be right that it's not worth it. Didn't know it was already possible to pass mdast to the processor.

Looks like you already know how to improve performance: not do useless work. Why are you doing things twice?

This has always been like that historically.

We bundle 2 apps with Webpack, mostly the same but with differences here and there on the output, one for the client (served to browsers) and one for the server. The server app is then run during SSG in Node.js to generate the static files.

Both apps bundle the MDX files with a Webpack loader to render them but the subtle differences between client/server outputs are not really affecting the MDX compilation so it's useless to do this work multiple times.

I plan to explore and see if it's not possible to run the Node SSG phase against the client app (and remove the need for a server app), but afaik it's quite common in modern React frameworks to compile things for each env (client/server/rsc). (but I suspect we do not have the same constraints as Next.js so it might work)

Webpack cache loader was a quite popular loader (now deprecated, superseded by global Webpack 5 persistent caching) that permitted to cache loader result on file system (by default) but with ability to customize: https://v4.webpack.js.org/loaders/cache-loader/


It’s the same as TS? You compile TS to JS. You don‘t compile things every time you run them?

I kind of see this from a different angle:

For code TS code in your own app using loaders, you retranspile it every time you restart your bundler. Unless your bundler has a caching system. And Webpack cache is not "shared" between different configs/compilers, because those configs may eventually affect the transformation and lead to different result.

To give a more concrete example based on this repo's mdx loader

const clientConfig = {
  name: "client",
  entry: "./some-doc.mdx",
  module: { rules: [{ test: /\.mdx?$/, use: ["@mdx-js/loader"] }] },
};

const serverConfig = {
  name: "server",
  entry: "./some-doc.mdx",
  output: {libraryTarget: 'commonjs2'}, // Example of difference
  module: { rules: [{ test: /\.mdx?$/, use: ["@mdx-js/loader"] }] },
};

await compile([clientConfig, serverConfig]);

You can see in the simplified example above that we compile 2 times the same app with a different output target (in practice the client/server entry points are slightly different). In this case the MDX file "./some-doc.mdx" will be compiled twice by MDX. Even if you turn on Webpack memory or persistent caching, it won't prevent that.

I see 2 options:

Do you have a better alternative to suggest in this situation?

slorber commented 2 months ago

@ChristianMurphy I'm already having a POC of running our own website with Rspack.

https://github.com/facebook/docusaurus/pull/10402

We see some perf benefits but could improve even more. Using Rsdoctor shows that mdx and postcss loaders are the remaining bottlenecks.

If we migrated to a full Rust build toolchain, I'm afraid our community wouldn't be happy if they couldn't write remark plugins in JS anymore, so I'd prefer to only move certain parts to Rust (by the way, I don't know Rust yet 😅).

I may read that cpuprofile wrong but to me mdast-util-from-markdown looks like an expensive task here relative to others:

CPU.20240816.154753.69358.0.001.cpuprofile


Crossing the boundary between rust and JS has trade offs

Agree that there are tradeoffs to consider.

Instead of exchanging large string payloads, what about sending smaller ones and let Rust read Markdown from the file system, process it, and write the result to the file system instead of returning it?

We could even batch the pre-parsing of all the mdx files we found to pre-populate caches in one pass, that loaders could then use.

ChristianMurphy commented 2 months ago

If we migrated to a full Rust build toolchain, I'm afraid our community wouldn't be happy if they couldn't write remark plugins in JS anymore

And that's fair. I'd come back to

There are still more performance optimizations to be realized in remark-parse. If you're interested in diving into those profiles, there are likely hotspots that can be improved.

I think there's still a lot that can be done to improve parsing time in JS. There is a theoretical most performant way to parse, both Rust and JS can get pretty close to that limit. Rust is faster right now, less because of any magical quality of the language, but because Titus applied new parsing techniques there, that haven't yet been applied/backported to the JS implementation.

If your team could invest resources, whether that be developer time or investing in Titus' time, that progress toward parser improvements (both in JS and Rust) could be accelerated.

Instead of exchanging large string payloads, what about sending smaller ones and let Rust read Markdown from the file system, process it, and write the result to the file system instead of returning it?

Crossing to the operating system layer, and especially the file system layer. Would be more expensive than communicating directly between the JS and Rust processes. Going process to process could potentially happen within the CPU cache (very fast), worse case it might drop to RAM (medium fast), but writing to a file drops down to a drive (not fast).

slorber commented 2 months ago

Titus applied new parsing techniques there, that haven't yet been applied/backported to the JS implementation.

Thanks, didn't know about that. Looking forward for these techniques to be backported.

If your team could invest resources, whether that be developer time or investing in Titus' time, that progress toward parser improvements (both in JS and Rust) could be accelerated.

My team is mostly me working 2d/week on Docusaurus 😅.

I'd be happy to support MDX financially if I were a Meta stakeholder, but I'm just a freelancer, and spending Meta's money on someone else than me involves bureaucracy that I'm not really aware of. We have money on opencollective and the last time I tried to unlock it for one of our contributors my request staled. I could try again though. As far as I remember, it would require documenting a clear goal/reason for spending that money.


Crossing to the operating system layer, and especially the file system layer. Would be more expensive than communicating directly between the JS and Rust processes. Going process to process could potentially happen within the CPU cache (very fast), worse case it might drop to RAM (medium fast), but writing to a file drops down to a drive (not fast).

The way I see it is that the loader will read on the file-system anyway. And if we have a cache and make it persistent it also means writing it to the file system. And it's potentially done earlier, with a different timing.

Even if something is more expensive in terms of IO, this might still be worth it. After all, at the bundling phase, we have to read a mdx file on the file-system anyway. Instead of reading it, we could read another file: the pre-parsed one.

What I mean is that we would be increasing the work done before bundling (but this extra work can be done in parallel to other work we already do). And we would decrease the time it takes to bundle. We might transform 4s of CPU work to 5s of IO+CPU work, and it might still be worth it due to parallelization and avoiding a waterfall.

wooorm commented 2 months ago

I could try again though. As far as I remember, it would require documenting a clear goal/reason for spending that money.

That would be cool! “Continued development and support of MDX”? 🤷‍♂️ I sometimes do contracting as well btw. So if there’s a feature / plugins / wrappers to be made, that could work. micromark and mdxjs-rs were both made on the payroll of companies.

Do you have a better alternative to suggest in this situation?

Nope, indeed. There’s also: but maybe you want/have to compile it twice, perhaps there are differences between the server and the client.

slorber commented 2 months ago

I could try again though. As far as I remember, it would require documenting a clear goal/reason for spending that money.

That would be cool! “Continued development and support of MDX”? 🤷‍♂️ I sometimes do contracting as well btw. So if there’s a feature / plugins / wrappers to be made, that could work. micromark and mdxjs-rs were both made on the payroll of companies.

I will ask and see what I get answered 👍 I'd be happy if we could support your work. We could probably find some projects to work on, for example, porting remark-directive in Rust could be one, or enhancing the speed of JS remark-parse?

Do you have a better alternative to suggest in this situation?

Nope, indeed. There’s also: but maybe you want/have to compile it twice, perhaps there are differences between the server and the client.

The only use case we have for different MDX processing in dev/prod is this 😅

CleanShot 2024-08-23 at 14 26 45

Once we share compilation between client/server, this will go away.

I don't think our users have different behaviors in client/server considering we only introduced this file.data.compilerName recently and it's not documented anywhere.

wooorm commented 2 months ago

porting remark-directive in Rust could be one, or enhancing the speed of JS remark-parse?

Yes! :+1:

I don't think our users have different behaviors in client/server considering we only introduced this file.data.compilerName recently and it's not documented anywhere.

I wonder about RSC? As I understand it, components will sometimes run on the server, sometimes on the client, hydrating the earlier server stuff. And then webpack is already making “server” bundles, and different “client” bundles, which might both include the same original .js file? If this was slow, then given meta’s resources, I’d assume the webpack RSC loader would cache work too, just like you are thinking about with MDX here now. Right? If they don’t cache, that’s interesting. Why not? Perhaps there are good reasons to not cache .mdx/.md too. If they do cache, is .mdx/.md in that cache just like .js too? Why/why not?

logUnusedDirectivesWarning

🤔 our file interface has places for messages. Many plugins also emit warnings. It seems like this code could also call file.message(), and then other places could report message. And you could do that always; no need to only do it on the client?

slorber commented 2 months ago

I don't think it's related to RSC but as soon as you do SSR or SSG, you do need multiple bundles that are not exactly the same.

In our case, the 2 bundles are quite similar, but with the introduction of RSC the 2 bundles could become significantly different because server components are only in the server bundle and client components are only in client bundles, or eventually both).

I don't know about Meta's resources, and their RSC bundling performance problems, but I think having multiple different bundles is required. I'm not sure what you want to cache here. We compile the same code in 2 very different ways so not sure what kind of cache can be shared between these.

But to me that's not really the subject for us here because we don't use RSCs yet.

In an RSC app using MDX, you'd probably render MDX docs on the server only because they are static, and only include a serialized JSON version on the client app. I doubt there's good use-case for adding "use client". Afaik MDX docs do not support directives and client components are expected to be in separate files, so this is not expected to work and MDX docs are expected to be packaged as JSX only in the server app:

"use client"

# Hello

content

import {useState} from "react"

export function MyClientComponent() {
  "use client"

  const [state, setState] = useState(...)
  // ...
}

In a non-RSC app using MDX, you compile and bundle the same MDX doc as JSX for both apps. Even if the 2 apps are different, the JSX version of the MDX doc for both apps is very likely to be similar. That's what I want to optimize.

Note: we already use Webpack persistent caching, so rebuilding Docusaurus (including MDX) is faster. But that's not the point here: I'm looking to make cold builds (empty Webpack cache) faster by sharing the MDX compiler result across 2 loaders.

slorber commented 2 months ago

logUnusedDirectivesWarning

🤔 our file interface has places for messages. Many plugins also emit warnings. It seems like this code could also call file.message(), and then other places could report message.

Yes, historically we haven't used the message infrastructure from Unified, but I plan to do it in the future. Similarly I could use processor data instead of file data.

And you could do that always; no need to only do it on the client?

The same remark plugins is run against each MDX file for both client/server compilers (that's the actual problem!). If we remove that condition, then the message will be logged twice for each file.

wooorm commented 2 months ago

with the introduction of RSC the 2 bundles could become significantly different

Right, that’s what I’m going after: there will be different bundles, where webpack will process files differently. If it processes 1 javascript file in 2 ways, then it makes sense that it processes 1 mdx file in 2 ways too.

But to me that's not really the subject for us here because we don't use RSCs yet.

I am wondering about this from the premature optimization point: you can do work to cache MDX now. Then you add RSC (is that not inevitable?), and perhaps then you cannot cache anymore?

you'd probably render MDX docs on the server only because they are static

Why should MDX be static?

Afaik MDX docs do not support directives and client components are expected to be in separate files, so this is not expected to work and MDX docs are expected to be packaged as JSX only in the server app:

Not yet supported, indeed. But I can imagine someone asking for this feature?

the JSX version of the MDX doc for both apps is very likely to be similar. That's what I want to optimize.

I wonder about this. Perhaps!

slorber commented 2 months ago

I am wondering about this from the premature optimization point: you can do work to cache MDX now. Then you add RSC (is that not inevitable?), and perhaps then you cannot cache anymore?

Even if the 2 bundles are different, I believe the mdx => jsx conversion will remain the same in both cases. Just that one will be evaluated and not the other.

Why should MDX be static?

MDX is content. Even if the RSC payload of a mdx file is static, it's still able to reference interactive React components. You can render MDX docs only on a server and still make interactive content. That's assuming that users do not use client-only hooks directly inlined into MDX docs. But even if they do, the mdx => jsx compilation input/output will probably remain the same no?

Not yet supported, indeed. But I can imagine someone asking for this feature?

That's possible but we could also decide to not support it in Docusaurus on purpose and ask client components to always be in separate files.