Vulnerable Regular Expressions

cristianstaicu commented 7 years ago

The following regular expressions used in underscore and unescapeHTML methods are vulnerable to ReDoS:

/([A-Z\d]+)([A-Z][a-z])/g
/\&([^;]+);/g

The slowdown is moderately low (for 50,000 characters around 2 seconds matching time). I would suggest one of the following:

remove the regex,
anchor the regex,
limit the number of characters that can be matched by the repetition,
limit the input size.

If needed, I can provide an actual example showing the slowdown.

kvanbere commented 7 years ago

anchor the regex

Could you please elaborate on the fix, I'm curious. Thank you!

cristianstaicu commented 7 years ago

Sorry, that was a standard comment I sent around. In this case it would not work! The anchoring solution is well suited when the regex is used for validation purposes. Let us say you want to check an input string is an email address. Instead of having a simple regex like:

/[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]/

you should have something like:

/^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]$/

Word bounadries can also be used as explained here: https://www.regular-expressions.info/email.html

This will decrease the complexity of the matching since the engine will only try to match the regex once, not n (length of input string) times like in the case of the first regex.

kvanbere commented 7 years ago

Thanks for the explanation!

I hope this gets fixed soon, it's giving my package red lights :(

burt202 commented 7 years ago

This issue now seems to being picked up by an nsp check making our builds red too. See https://nodesecurity.io/advisories/536

danzGentrack commented 7 years ago

The slowdown is moderately low (for 50,000 characters around 2 seconds matching time).

Just wondering why it is a problem? For example, the string "underscore" algorithm could be a very complex one and may need lots of seconds to process a 50,000 characters long string.

cristianstaicu commented 7 years ago

I recommend the following two articles for understanding the issue: https://blog.liftsecurity.io/2014/11/03/regular-expression-dos-and-node.js/ https://dl.acm.org/citation.cfm?id=3065916

Basically, the problem comes from the fact that the regex matching is done in the main loop of Node.js and thus blocking this loop for 2 seconds is equivalent to blocking all the incoming requests for 2 seconds.

danzGentrack commented 7 years ago

Thanks for the info. The example in the first link is a classic evil regex and it can be written in a more efficient way. var emailExpression = /^([a-zA-Z0-9_.-])+\@(([a-zA-Z0-9-])+.)+([a-zA-Z0-9]{2,4})+$/;

The regex used in the "underscore" case is fine: /([A-Z\d]+)([A-Z][a-z])/g , which may not incur much performance penalty compared to an alternative implementation. I understand the impact of slowing regex matching to an application. My point is that if a regular expression is used correctly, then it is not a problem to be slow.

revelt commented 7 years ago

For posterity, Snyk also reports it: https://snyk.io/vuln/npm:string:20170907

I'm looking forward to solution. The html-stripping features, for example, at the competition are very poor and I can't find the alternative for String.js..

ericnorris commented 7 years ago

Hey there! I haven't looked too closely at this issue but it was linked by @revelt above. I'd definitely suggest using the striptags library for removing HTML. It is based on the PHP function strip_tags, which is widely known to be safe.

In the README I link to two articles, one indicating the regular expressions are not safe for removing dangerous HTML (here), and the second demonstrating that PHP's strip_tags function is not vulnerable (here).

https://github.com/ericnorris/striptags take a look to see if that works for you - it uses a state machine very similar to the one that is in the PHP internals instead of a regular expression. As a result it is super fast and just as secure!

stevenvachon commented 6 years ago

This is a critical issue. The author and the contributors don't seem to care at all. Kill this project with fire.

revelt commented 6 years ago

@steve come on, it's not that bad :) team is just regrouping, I'm sure guys will fix that eventually

yieme commented 6 years ago

@jprichardson, string.js is a nice library and I would prefer to continue using it; however, such a vulnerability isn't something that can be used in production. Are you aware of this issue or has the project been abandoned as the last update was about 11 months ago?

nconf-base commented 6 years ago

@revelt, I can appreciate the concerns raised by @steve and @yieme correctly points out lack of update activity for almost a year. This vulnerability has been live for over a month. At what point should we consider "eventually" unacceptable?

revelt commented 6 years ago

@nconf-base I think now, because if you have string.js in production, your code is vulnerable now. Personally, I'm going to switch to alternatives asap and equally come back asap when it's fixed.

As far as my stuff is concerned, I already replaced every string.js methods with alternatives on my libraries except I can't find a good quality, nothing-assumed tool to strip HTML tags from strings. Within October I plan to release the alternative, string-only, no-regex library which strips HTML, which will untie me completely from string.js.

I'm thinking, it can't be that all the methods of string.js are affected, can it? If some are OK, maybe we could split string.js into separate libs like lodash does with lodash.* and so on? This way, affected methods/libraries would be isolated and it would be easier to pinpoint the culprits. You know, divide and conquer approach...

nconf-base commented 6 years ago

@revelt, I think you are correct the issue only affects a couple of the methods, specifically underscore and unescapeHTML, as per https://nodesecurity.io/advisories/536

I think your idea of decomposing the larger string.js library into smaller pieces might help those who use the library without those particular methods. Splitting out the problem methods on a fork might be a start.

How does your strip tags method differ from https://github.com/ericnorris/striptags that eric pointed out?

revelt commented 6 years ago

@nconf-base it's assuming too much what results in bunch of false positives. I just scratched the surface with that issue, many of my unit tests fail because of striptags if I switch. Same with other similarly-named libraries from the first npm results page, I raised issues on at least two others.

For the record, string.js copes fine stripping HTML from stuff like aaa<bbb.

ericnorris commented 6 years ago

@revelt, I'm not sure I would agree that striptags results in "false positives". Any function that ostensibly strips HTML must assume all input is HTML, and as such must remove tags, incomplete or not.

Leaving aaa<bbb may result in unsafe HTML if it is concatenated with stuff on the page. A more concrete example is if it is fed aaa<script src="evil.js". The tag is incomplete, but when placed on a page (or concatenated with other safe HTML) may end up producing valid HTML that results in evil.js executing. That's not acceptable for a library responsible for removing HTML.

If the input is "safe", then < should be HTML encoded, e.g. aaa<bbb.

@nconf-base I would strongly suggest using striptags! I am, of course, biased, but it is based on the very popular function from PHP (http://php.net/manual/en/function.strip-tags.php) and is battle-tested.

revelt commented 6 years ago

@ericnorris I see your point. Basically, there are two kinds of HTML-stripping cases and we both are proponents of each.

My case is Detergent.js, email text copy cleaning app - HTML can come in the input, there is high risk of false positives, but zero chance of rogue code being executed and code if does come, it comes unescaped. This way, code a < b and c > d is very likely and legit (consider even b as valid tag name).

Your case is a regular web-dev ops where you clean HTML in order to defuse potentially malicious code. Prioritising security over risk of false positives. All code assumed to be escaped and unescaped-one a no-no.

String.js was this middle-ground solution, being able to detect aaa<bbb correctly, yet, at the same time, reliably strip malicious HTML for years in everybody's apps. It tended both sides. I'd gladly use striptags if it tended both sides as well.

Do you see my point?

ghost commented 6 years ago

@revelt that's not tending both sides. < is never safe in HTML and under some legacy parsing modes I think the browser may even auto-complete what it thinks is an incorrectly truncated tag with >. Any library that leaves any < inside something to be embedded into HTML is simply unsafe to use for HTML output, plain and simple. (since < outside of a HTML tag is never valid)

Edit: If you refer to the use case of stripping first and then replacing any remaining < and > afterwards with &lt and &gt anyway, I suppose that's fair enough.

revelt commented 6 years ago

@JonasT Yes, you are right, it's not safe and not valid in HTML. But... Detergent accepts both HTML and pure text and outputs text with sprinkled HTML or without. There is a thin path in here where > is valid both as input and output. For example, if people want to remove invisible characters from some text copy. Like a < b and c > d\u0003 (invisible ETX character in the end). That's a legit request and Detergent will do that. I want to produce a < b and c > d, not a d\u0003 or a d. Do you see my point?

Since user pastes the copy into Detergent, input is assumed to be safe, so I see no problems letting them output unencoded < if they wish (encoding is turned on by default, but it can be turned off).

Having said that, I'm about to publish my type of HTML stripping tool which behaves like that.

StorytellerCZ commented 6 years ago

@az7arul Are you currently maintaining this or who is in charge here?

az7arul commented 6 years ago

Hi @StorytellerCZ, I wasn't maintaining this for a while. But for this I can take a look this weekend, meanwhile any PR that address this is welcome.

ghost commented 6 years ago

// This function converts a binary string (can be HTML or JavaScript) to a safe string that can be used in a HTML file within JavaScript tags. function EncodeHTML(S) { var i, C; S = S.split(""); for (i = 0; i < S.length; i++) { C = S[i].charCodeAt(0); if (C == 7) S[i] = "\t"; else if (C == 10) S[i] = "\r"; else if (C == 13) S[i] = "\n"; else if (C == 92) S[i] = "\\"; else if (C == 34 || C == 38 || C == 39 || C == 60 || C == 62 || C < 32 || C > 126) S[i] = "\x" + toHex(C); } return '"' + S.join("") + '"'; }

// This function converts a Js safe string with quotation marks around it to its original source. It's the opposite of EncodeHTML() function DecodeHTML(S) { S += ""; if (S.length > 1) { if (S.charAt(0) == '"' || S.charAt(0) == "'") S = S.slice(1, -1); } S = S.split("\t").join("\t").split("\r").join("\r").split("\n").join("\n").split("\\").join("\").split("\x"); for (var i = 1; i < S.length; i++) S[i] = String.fromCharCode(parseInt(S[i].substr(0, 2), 16)) + S[i].slice(2); return S.join(""); }

ghost commented 6 years ago

Oops. I just noticed that the above code didn't come out right. I should have put that code within CODE section, not as plain text, because the double backslash characters have been replaced with single. :/

najibk commented 6 years ago

Hello, any news about this ? Thanks

angleman commented 6 years ago

@az7arul, thank you for looking into this. We have nsp check as part of our deploy process and this security issue breaks the build.

az7arul commented 6 years ago

I looked into this briefly and used https://github.com/jagracey/RegEx-DoS to scan for other vulnerabilities and more were reported by the scanner. I am afraid I have no quick fix for this right now.

I don't have a timeframe when this is going to be solved though.

commenthol commented 6 years ago

Hi all,

This should fix the 1st vulnerability:

function underscore (str) {
  var s = str.trim().replace(/\s*([A-Z])/g, function (_, m) {
    return '_' + m.toLowerCase()
  })
  return s
}
let s = underscore('oneAtATime AnotherWordAtATime')
assert.equal(s, 'one_at_a_time_another_word_at_a_time')

The 2nd one should be gone with limiting the the number of chars an entity might have. To my knowledge <blockquote> or <figcaption> are the longest HTML elements.

/\&([^;]+);/ --> /&([^;]{1,20});/

BTW there is no need to escape on &

@cristianstaicu et all would be nice if you could review these thoughts? Thanks.

ankurloriya commented 6 years ago

Is there any alternative package for this library.? We are using few functions mostly humanize. We are resolving security concern to the our project.

revelt commented 6 years ago

@ankurloriya No, the all-in-one monolithic replacement doesn't exist.

But you can replace its methods one-by-one with libraries from npm. For example, I replaced replaceAll() and with replace-string, lines() with split-lines. There were no suitable HTML tag stripping libraries to replace stripTags() so I coded my own. If your unit tests are thorough they will pick up any inconsistencies (or their absence) of a replacement library. I found it difficult to judge replacements by their readmes, - my unit tests were the judge.

Performance-wise, switching from monolithic string.js to separate libs will reduce your bundle's footprint too. You might even get some of them as Rollup'ed 3-in-1 builds today, where your Webpack/Rollup will tap their ES module's build directly. On other hand, it will take time for string.js to migrate to Rollup, it doesn't have an ES Module build at the moment.

yieme commented 6 years ago

We changed to using various other libraries due to this security issue. Appears the project may be abandoned with #218 the http://stringjs.com/ domain apparently going away. Sad, it was a good library to have a lot of useful tools in one place. Hopefully it gets turned around for people. We just couldn't wait any longer for the security issue to be resolved. Cheers.

jprichardson / string.js

Vulnerable Regular Expressions #212