Performance issue with long lines

llimllib commented 2 months ago

Initial checklist

[X] I read the support docs
[X] I read the contributing guide
[X] I agree to follow the code of conduct
[X] I searched issues and couldn’t find anything (or linked relevant results below)

Affected packages and versions

at least the latest version

Link to runnable example

https://gist.github.com/llimllib/4b87fd4042359b4812350038562dba03

Steps to reproduce

Run the test.mjs file from the gist above in the root of this repository. Its output on my machine is:
```
$ node test.mjs
parsing a file with breaks: 32.809ms
parsing a file without breaks: 10.127s
```

In this test case, the "file with breaks" is 8000 lines of 100 a characters. The "file without breaks" is 800,000 'a' characters without line breaks, so actually a slightly smaller file but with no line breaks.

here's an observable notebook that demonstrates the behavior, you can play around with the regular expression there and see how performance looks

Expected behavior

parsing a file with a long line should not have such severe performance effects.

(In the terms that affect my team, our customers should not be able to easily DoS us by providing markdown files with long lines)

Actual behavior

The findEmail regular expression here takes a very long time to parse a long line.

(I think this is due to using nested quantifiers, but I am not an expert here)

I noticed this when investigating a slowdown in our app, developed a test case, and profiled it. This regular expression jumped out:

Affected runtime and version

node 20.16.0

Affected package manager and version

all

Affected OS and version

all

Build and bundle tools

No response

llimllib commented 2 months ago

for what it's worth, I tried finding a regular expression that worked better and was not successful; perhaps it is sensible to do something like check the line for an @ before running the regular expression?

That would still leave a long line full of @ as a DDoS vector, but it'd be an improvement at least?

ChristianMurphy commented 2 months ago

Welcome @llimllib! 👋 Performance improvements are welcome!

as a DDoS vector

Respectfully a major of "DDoS vectors", including this, are a nothingburger. Longer content takes longer, if you let users put content of unrestricted length in, it could be slow.

llimllib commented 2 months ago

@ChristianMurphy we render mdx on our servers for our clients; if somebody were to start spamming us with long strings, it would cost a lot of money and slow down our system for everybody

llimllib commented 2 months ago

Limiting the length of + matches to 1024 speeds up the test file, but it also makes tests fail and I don't quote have a handle on why:

// Use {0,1024} instead of `+` to avoid pathological regular expression
// behavior; see
// https://github.com/syntax-tree/mdast-util-gfm-autolink-literal/issues/8
[
  /([-.\w+]{0,1024})@([-\w]{0,1024}(?:\.[-\w]{0,1024}){0,1024})/g,
  findEmail
]

That expression matches emails the same as the current one:

> "bill@billmill.org".match(/([-.\w+]{0,1024})@([-\w]{0,1024}(?:\.[-\w]{0,1024}){0,1024})/g)
Array [ "bill@billmill.org" ]

ah ha! it's matching things it ought not to, and of course the proper replacement for + is {1,1024} not {0,...}, that's what + means

ChristianMurphy commented 2 months ago

we render mdx on our servers for our clients; if somebody were to start spamming us with long strings, it would cost a lot of money and slow down our system for everybody

Yes, that is how programming languages work. Please read the MDX docs https://mdxjs.com/docs/getting-started/#security this has already been addressed.

llimllib commented 2 months ago

@ChristianMurphy I truly have no idea why you're being mean here

ChristianMurphy commented 2 months ago

MDX is a full programming language, it compiles to JavaScript. If you are, as you imply, letting your untrusted user run content without any restrictions. You are letting untrusted users run arbitrary JS on your server, and all the security and performance risks that go along with it. That is your choice, but that isn't a "vulnerability" or a "security issue" of MDX.

To use MDX safely you either:

trust your users
or, sandbox and timeout content

In both circumstances this perf issue you bring up, is a non-issue from a security perspective.

Again, happy to discuss improving parser speed. I'm mostly bringing this up because you seem to fundamentally misunderstand what is and is not a security concern.

llimllib commented 2 months ago

yes I know that! I do not see why you are spamming this issue with unrelated stuff

llimllib commented 2 months ago

@ChristianMurphy I presented a fix in #9

I understand exactly what this is
For us it is a security issue, I apologize if that bothers you, I guess? I'm not going to be filing a CVE or anything, just trying to make it so that long lines process in a reasonable period of time so we're not vulnerable to CPU usage attacks

ChristianMurphy commented 2 months ago

This parser matches GFM. GFM allows emails to include more than 1024 characters, for example this has 1500 before @ and 1500 after:

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa@aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.com

yes I know that!

Then you are aware the premise of this issue is a non-issue.

just trying to make it so that long lines process in a reasonable period of time so we're not vulnerable to CPU usage attacks

I sympathize and agree with this. This can, and should be done. You can do this by isolating the the parse and render process and giving it a timeout.

llimllib commented 2 months ago

The maximum legal email address is 254 characters

An email address must not exceed 254 characters.

This was accepted by the IETF following submitted erratum. A full diagnosis of any given address is available online. The original version of RFC 3696 described 320 as the maximum length, but John Klensin subsequently accepted an incorrect value, since a Path is defined as

You can do this by isolating the the parse and render process and giving it a timeout.

we do that already! but I would like to have the parser not have to fail if my customers include a long line in it

ChristianMurphy commented 2 months ago

valid emails are defined by https://datatracker.ietf.org/doc/html/rfc5322 and https://datatracker.ietf.org/doc/html/rfc6854 These have no such limitation.

https://datatracker.ietf.org/doc/html/rfc3696 are a collection of recommendations for validators but not a specification of a valid email. From the introduction

This document draws summaries of the applicable rules together in one place and supplies references to the actual standards. It does not add anything to those standards; it merely draws the information together into a form that may be more accessible.

llimllib commented 2 months ago

I have closed my PR, you're clearly not interested.

I feel that you behaved very poorly towards me, so I'll just wish you an excellent day - I feel quite terrible after this interaction.

ChristianMurphy commented 2 months ago

Thank you for your input and contributions.

I understand your concerns about performance, but it's essential that any solution aligns with GFM specifications. This project must remain compliant with those standards, and it's not designed to cater to specific performance needs of individual organizations.

I’m open to discussing performance improvements, but they must adhere to GFM. If you'd like to explore this further within those constraints, I'm happy to collaborate. Otherwise, I respect your decision to step back.

wooorm commented 2 months ago

if somebody were to start spamming us with long strings, it would cost a lot of money and slow down our system for everybody

a) MDX is dangerous itself, if you let people write MDX, you are vulnerable, this is extensively documented b) the point you are making is indeed something that you should prevent, but it is not something that we should prevent; people can always pass 100mb documents to you, and that is dangerous and you should prevent it; but there are also 100mb documents possible that are (slow but) safe

@llimllib I think you are are being disrespectful. This is an open source project that owes you nothing. You owe us everything. You have never helped us. You have never given us money. You misunderstand what MDX is, what our goals are, and what this problem is. Your PR breaks GFM.

wooorm commented 2 months ago

I’ve played with a couple things, and the below code gives the same problem:

const find = /([-.a-z\d+]+)@([-a-z\d]+(?:\.[-a-z\d]+)+)/gi

const hundredA = 'a'.repeat(100)
const fileWithBreaks = `${hundredA}\n`.repeat(800)
const fileWithoutBreaks = `${hundredA}`.repeat(800)

console.time('parsing a file with breaks')
find.lastIndex = 0
find.exec(fileWithBreaks)
console.timeEnd('parsing a file with breaks')

console.time('parsing a file without breaks')
find.lastIndex = 0
find.exec(fileWithoutBreaks)
console.timeEnd('parsing a file without breaks')

So, it has nothing to do with all the projects. This regex with those giant documents has this problem. It’s also noticeable on Safari btw, so not just V8. And, changing the regex: changing the other +s doesn‘t do anything. The slowing down happens because the first +.

Also, important to note, we’re talking about 80kb of data here.

wooorm commented 2 months ago

Found the limit! copy('a'.repeat(32758) + '@b.c') does not link on GH, copy('a'.repeat(32757) + '@b.c') does. Weird how it’s close but different to 2**15 (32768)? Regardless, using {1,32757} instead of + just makes things slower 🤷‍♂️

wooorm commented 2 months ago

We can make the regex incredibly fast by adding (?<=^|\p{P}|\p{S}|\s) (and /u) to the regex.

parsing a file with breaks: 0.951ms
parsing a file without breaks: 0.593ms

Without it, it was:

parsing a file with breaks: 12.837ms
parsing a file without breaks: 8.809s

So that’s a giant difference.

The actual disallowed previous is a bit more complex:

https://github.com/syntax-tree/mdast-util-gfm-autolink-literal/blob/c990a627672842e8ba52fc27532c886e2491e112/lib/index.js#L274-L277

Particularly around the /? Not sure how to best create this regex: nothing, all whitespace, all punctuation, all symbols, not /?

github-actions[bot] commented 2 months ago

Hi! This was closed. Team: If this was fixed, please add phase/solved. Otherwise, please add one of the no/* labels.

wooorm commented 2 months ago

@erunion @bnikom @ilias-t @kellyjosephprice @rafegoldberg @llimllib I will block you if you don’t change your childish ways. @llimllib proposed a bad change. I came up with a good change. And then you act unprofessional? This reflects poorly on your company, its culture.

kellyjosephprice commented 2 months ago

I'm so sorry that this thread has gone astray. @llimllib is only trying to help. Is there some place in the code of conduct or other supporting documentation that would describe how @llimllib or others have acted inappropriately?

wooorm commented 2 months ago

I’d recommend not getting hung up on particular solutions. Users know their problems well. Maintainers have a better idea on solutions. I believe Bill ran into an XY problem. Some places on the internet mentioned 254, 1024 was found somewhere, but those are not numbers GFM uses. Regexes can be improved in other ways. I’d also recommend a thicker skin on the internet, particularly dry technical places like GH. Christian is correct in saying that if you allow attackers to write in a programming language things will get slow and it will cost you. It’s not mean to point that out.

I don’t worry too much about that though. Maybe Christian feels different.

What I found annoying is that despite that, I researched and presented a solution, a good regex, which was ignored. The URL was shared internally, everyone upvoted their friend, downvoted our feedback. That makes me feel like my work is not appreciated.

As for community info, see contributing.md, and code-of-conduct.md.

ChristianMurphy commented 2 months ago

The recent conversation crossed a professional line.

Mischaracterizing the Issue: @llimllib, by persistently framing a performance issue as a security vulnerability, despite multiple clarifications, you diverted the discussion from productive problem-solving. This behavior didn’t align with the code of conduct’s emphasis on focusing on what is best for the community and gracefully accepting constructive criticism.
Argumentative Behavior: @llimllib, your approach in the discussion was not in line with the collaborative and respectful environment we strive to maintain. By dismissing the maintainers' technical explanations and continuing to argue a point that had already been addressed, you created unnecessary conflict and hindered the problem-solving process. This behavior goes against our code of conduct’s standard of maintaining a constructive and respectful dialogue.
Escalation: Closing the PR and expressing that you felt mistreated without further discussion was counterproductive and wasted time that could have been better spent resolving the issue. This action goes against the spirit of our contributing guide, which emphasizes collaboration and open discussion before making decisions.

@kellyjosephprice, while I understand the desire to support a coworker, it’s essential to encourage constructive behavior, especially in challenging situations. Backing unproductive actions reflects poorly on the README team and undermines the professionalism we strive to maintain.

This project serves millions of adopters, and both @wooorm and I have been generous with our time and energy to ensure it runs smoothly. @llimllib, your behavior in this situation was unprofessional and disruptive, which is not acceptable under our code of conduct.

Additionally, I want to express broader frustration with README's approach. Relying heavily on Unified, MDX, and Remark to build an entire product without contributing back, either through code or financially, and then participating in issues like this with a lack of good faith shows a disregard for the principles of open source. It’s frustrating and disappointing to see adopters, whose livelihoods depend on these tools, engage with us in such a way.

Let’s all aim to maintain a respectful and constructive tone in future interactions, ensuring that we adhere to both the contributing guide and the code of conduct.

syntax-tree / mdast-util-gfm-autolink-literal