Closed llimllib closed 2 months ago
for what it's worth, I tried finding a regular expression that worked better and was not successful; perhaps it is sensible to do something like check the line for an @
before running the regular expression?
That would still leave a long line full of @
as a DDoS vector, but it'd be an improvement at least?
Welcome @llimllib! 👋 Performance improvements are welcome!
as a DDoS vector
Respectfully a major of "DDoS vectors", including this, are a nothingburger. Longer content takes longer, if you let users put content of unrestricted length in, it could be slow.
@ChristianMurphy we render mdx on our servers for our clients; if somebody were to start spamming us with long strings, it would cost a lot of money and slow down our system for everybody
Limiting the length of +
matches to 1024
speeds up the test file, but it also makes tests fail and I don't quote have a handle on why:
// Use {0,1024} instead of `+` to avoid pathological regular expression
// behavior; see
// https://github.com/syntax-tree/mdast-util-gfm-autolink-literal/issues/8
[
/([-.\w+]{0,1024})@([-\w]{0,1024}(?:\.[-\w]{0,1024}){0,1024})/g,
findEmail
]
That expression matches emails the same as the current one:
> "bill@billmill.org".match(/([-.\w+]{0,1024})@([-\w]{0,1024}(?:\.[-\w]{0,1024}){0,1024})/g)
Array [ "bill@billmill.org" ]
ah ha! it's matching things it ought not to, and of course the proper replacement for +
is {1,1024}
not {0,...}
, that's what +
means
we render mdx on our servers for our clients; if somebody were to start spamming us with long strings, it would cost a lot of money and slow down our system for everybody
Yes, that is how programming languages work. Please read the MDX docs https://mdxjs.com/docs/getting-started/#security this has already been addressed.
@ChristianMurphy I truly have no idea why you're being mean here
MDX is a full programming language, it compiles to JavaScript. If you are, as you imply, letting your untrusted user run content without any restrictions. You are letting untrusted users run arbitrary JS on your server, and all the security and performance risks that go along with it. That is your choice, but that isn't a "vulnerability" or a "security issue" of MDX.
To use MDX safely you either:
In both circumstances this perf issue you bring up, is a non-issue from a security perspective.
Again, happy to discuss improving parser speed. I'm mostly bringing this up because you seem to fundamentally misunderstand what is and is not a security concern.
yes I know that! I do not see why you are spamming this issue with unrelated stuff
@ChristianMurphy I presented a fix in #9
This parser matches GFM. GFM allows emails to include more than 1024 characters, for example this has 1500 before @ and 1500 after:
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa@aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.com
yes I know that!
Then you are aware the premise of this issue is a non-issue.
just trying to make it so that long lines process in a reasonable period of time so we're not vulnerable to CPU usage attacks
I sympathize and agree with this. This can, and should be done. You can do this by isolating the the parse and render process and giving it a timeout.
The maximum legal email address is 254 characters
An email address must not exceed 254 characters.
This was accepted by the IETF following submitted erratum. A full diagnosis of any given address is available online. The original version of RFC 3696 described 320 as the maximum length, but John Klensin subsequently accepted an incorrect value, since a Path is defined as
You can do this by isolating the the parse and render process and giving it a timeout.
we do that already! but I would like to have the parser not have to fail if my customers include a long line in it
valid emails are defined by https://datatracker.ietf.org/doc/html/rfc5322 and https://datatracker.ietf.org/doc/html/rfc6854 These have no such limitation.
https://datatracker.ietf.org/doc/html/rfc3696 are a collection of recommendations for validators but not a specification of a valid email. From the introduction
This document draws summaries of the applicable rules together in one place and supplies references to the actual standards. It does not add anything to those standards; it merely draws the information together into a form that may be more accessible.
I have closed my PR, you're clearly not interested.
I feel that you behaved very poorly towards me, so I'll just wish you an excellent day - I feel quite terrible after this interaction.
Thank you for your input and contributions.
I understand your concerns about performance, but it's essential that any solution aligns with GFM specifications. This project must remain compliant with those standards, and it's not designed to cater to specific performance needs of individual organizations.
I’m open to discussing performance improvements, but they must adhere to GFM. If you'd like to explore this further within those constraints, I'm happy to collaborate. Otherwise, I respect your decision to step back.
if somebody were to start spamming us with long strings, it would cost a lot of money and slow down our system for everybody
a) MDX is dangerous itself, if you let people write MDX, you are vulnerable, this is extensively documented b) the point you are making is indeed something that you should prevent, but it is not something that we should prevent; people can always pass 100mb documents to you, and that is dangerous and you should prevent it; but there are also 100mb documents possible that are (slow but) safe
@llimllib I think you are are being disrespectful. This is an open source project that owes you nothing. You owe us everything. You have never helped us. You have never given us money. You misunderstand what MDX is, what our goals are, and what this problem is. Your PR breaks GFM.
I’ve played with a couple things, and the below code gives the same problem:
const find = /([-.a-z\d+]+)@([-a-z\d]+(?:\.[-a-z\d]+)+)/gi
const hundredA = 'a'.repeat(100)
const fileWithBreaks = `${hundredA}\n`.repeat(800)
const fileWithoutBreaks = `${hundredA}`.repeat(800)
console.time('parsing a file with breaks')
find.lastIndex = 0
find.exec(fileWithBreaks)
console.timeEnd('parsing a file with breaks')
console.time('parsing a file without breaks')
find.lastIndex = 0
find.exec(fileWithoutBreaks)
console.timeEnd('parsing a file without breaks')
So, it has nothing to do with all the projects. This regex with those giant documents has this problem.
It’s also noticeable on Safari btw, so not just V8.
And, changing the regex: changing the other +
s doesn‘t do anything. The slowing down happens because the first +
.
Also, important to note, we’re talking about 80kb of data here.
Found the limit! copy('a'.repeat(32758) + '@b.c')
does not link on GH, copy('a'.repeat(32757) + '@b.c')
does.
Weird how it’s close but different to 2**15
(32768
)?
Regardless, using {1,32757}
instead of +
just makes things slower 🤷♂️
We can make the regex incredibly fast by adding (?<=^|\p{P}|\p{S}|\s)
(and /u
) to the regex.
parsing a file with breaks: 0.951ms
parsing a file without breaks: 0.593ms
Without it, it was:
parsing a file with breaks: 12.837ms
parsing a file without breaks: 8.809s
So that’s a giant difference.
The actual disallowed previous
is a bit more complex:
Particularly around the /
? Not sure how to best create this regex: nothing, all whitespace, all punctuation, all symbols, not /
?
Hi! This was closed. Team: If this was fixed, please add phase/solved
. Otherwise, please add one of the no/*
labels.
@erunion @bnikom @ilias-t @kellyjosephprice @rafegoldberg @llimllib I will block you if you don’t change your childish ways. @llimllib proposed a bad change. I came up with a good change. And then you act unprofessional? This reflects poorly on your company, its culture.
I'm so sorry that this thread has gone astray. @llimllib is only trying to help. Is there some place in the code of conduct or other supporting documentation that would describe how @llimllib or others have acted inappropriately?
I’d recommend not getting hung up on particular solutions. Users know their problems well. Maintainers have a better idea on solutions. I believe Bill ran into an XY problem. Some places on the internet mentioned 254
, 1024
was found somewhere, but those are not numbers GFM uses. Regexes can be improved in other ways. I’d also recommend a thicker skin on the internet, particularly dry technical places like GH. Christian is correct in saying that if you allow attackers to write in a programming language things will get slow and it will cost you. It’s not mean to point that out.
I don’t worry too much about that though. Maybe Christian feels different.
What I found annoying is that despite that, I researched and presented a solution, a good regex, which was ignored. The URL was shared internally, everyone upvoted their friend, downvoted our feedback. That makes me feel like my work is not appreciated.
As for community info, see contributing.md
, and code-of-conduct.md
.
The recent conversation crossed a professional line.
Mischaracterizing the Issue: @llimllib, by persistently framing a performance issue as a security vulnerability, despite multiple clarifications, you diverted the discussion from productive problem-solving. This behavior didn’t align with the code of conduct’s emphasis on focusing on what is best for the community and gracefully accepting constructive criticism.
Argumentative Behavior: @llimllib, your approach in the discussion was not in line with the collaborative and respectful environment we strive to maintain. By dismissing the maintainers' technical explanations and continuing to argue a point that had already been addressed, you created unnecessary conflict and hindered the problem-solving process. This behavior goes against our code of conduct’s standard of maintaining a constructive and respectful dialogue.
Escalation: Closing the PR and expressing that you felt mistreated without further discussion was counterproductive and wasted time that could have been better spent resolving the issue. This action goes against the spirit of our contributing guide, which emphasizes collaboration and open discussion before making decisions.
@kellyjosephprice, while I understand the desire to support a coworker, it’s essential to encourage constructive behavior, especially in challenging situations. Backing unproductive actions reflects poorly on the README team and undermines the professionalism we strive to maintain.
This project serves millions of adopters, and both @wooorm and I have been generous with our time and energy to ensure it runs smoothly. @llimllib, your behavior in this situation was unprofessional and disruptive, which is not acceptable under our code of conduct.
Additionally, I want to express broader frustration with README's approach. Relying heavily on Unified, MDX, and Remark to build an entire product without contributing back, either through code or financially, and then participating in issues like this with a lack of good faith shows a disregard for the principles of open source. It’s frustrating and disappointing to see adopters, whose livelihoods depend on these tools, engage with us in such a way.
Let’s all aim to maintain a respectful and constructive tone in future interactions, ensuring that we adhere to both the contributing guide and the code of conduct.
Initial checklist
Affected packages and versions
at least the latest version
Link to runnable example
https://gist.github.com/llimllib/4b87fd4042359b4812350038562dba03
Steps to reproduce
test.mjs
file from the gist above in the root of this repository. Its output on my machine is:In this test case, the "file with breaks" is 8000 lines of 100
a
characters. The "file without breaks" is 800,000 'a' characters without line breaks, so actually a slightly smaller file but with no line breaks.here's an observable notebook that demonstrates the behavior, you can play around with the regular expression there and see how performance looks
Expected behavior
parsing a file with a long line should not have such severe performance effects.
(In the terms that affect my team, our customers should not be able to easily DoS us by providing markdown files with long lines)
Actual behavior
The
findEmail
regular expression here takes a very long time to parse a long line.(I think this is due to using nested quantifiers, but I am not an expert here)
I noticed this when investigating a slowdown in our app, developed a test case, and profiled it. This regular expression jumped out:
Affected runtime and version
node 20.16.0
Affected package manager and version
all
Affected OS and version
all
Build and bundle tools
No response