rgrove / crass

A Ruby CSS parser that's fully compliant with the CSS Syntax Level 3 specification.
MIT License
138 stars 14 forks source link

Odd number parsing #10

Closed veesahni closed 4 years ago

veesahni commented 4 years ago

Environment: MRI Ruby 2.6.5 Crass 1.05

Real excerpt from an email sent by Outlook:

p.5e1367490fa5f06927cafe55msonormal {
  mso-style-name:5e1367490fa5f06927cafe55msonormal;
}

Test code:

require 'crass'
require 'pp'

str =  <<END
p.5e1367490fa5f06927cafe55msonormal {
  mso-style-name:5e1367490fa5f06927cafe55msonormal;
}
END

PP.pp Crass.parse str

Output: Can't paste on github since output is > 65KB. There are some giant numbers that result from attempting to convert "5e1367490fa5f06927cafe55msonormal" into a number.

Analysis: It appears that Crass is trying to parse these names (class name, style name) into a number. This is ending up in a giant number that results in large amounts of memory usage in the generated parse tree. Is this a bug?

rgrove commented 4 years ago

Fascinating!

Your analysis seems to be correct. It looks to me like Crass is actually parsing this CSS correctly according to the spec, which results in 5e1367490 being treated as a number token representing the number 5×10^1367490.

In this case the value of the number token shouldn't really matter since it ultimately won't be used as a number. But Crass doesn't impose an upper bound on numbers, and this is a ridiculously large one, which is what's causing the problem. Seems like the right thing to do here would probably be to clamp number values to a maximum value, but I (or someone) will need to do some research to see what, if anything, the available standards say about this in order to figure out what that number should be.

veesahni commented 4 years ago

Attempting to represent this in ruby:

2.6.5 :085 > 5e1367490
 => Infinity
2.6.5 :086 > 5e13674
 => Infinity
2.6.5 :087 > 5e1367
 => Infinity
2.6.5 :088 > 5e136
 => 5.0e+136
2.6.5 :089 > 5e136.class
 => Float
2.6.5 :090 > 5e136 == 5e136.to_i
 => true
2.6.5 :091 > 5e136.to_i
 => 50000000000000001642078124460246303949350628317980584775615671312937350344939399777200065781386370634197475239216121778932424531710574592

Note that 5e1367490.to_i raises an exception (since it's Infinity!). So ultimately this is a larger number than Ruby will easily let you set in the first place. It seems you're able to even generate this in Crass as a result of the formula that came from the CSS spec (tokenizer.rb/convert_string_to_number).

An exponent can be either a very large number or a very small number, so it's normally represented as a float (which does have limits). However, in this exact case, the giant number (5×10^1367490) is being handled as a Bignum internally and I can't find any reference to limits there except those imposed by system resources.

Suggestions for limits

Since Ruby would normally represent an exponent as a Float, we could use the MAX/MIN values provided by ruby-core / Float:

MAX
The largest possible integer in a double-precision floating point number.

Usually defaults to 1.7976931348623157e+308.

MAX_10_EXP
The largest positive exponent in a double-precision floating point where 10 raised to this power minus 1.

Usually defaults to 308.

MIN
The smallest positive normalized number in a double-precision floating point.

Usually defaults to 2.2250738585072014e-308.

If the platform supports denormalized numbers, there are numbers between zero and Float::MIN. 0.0.next_float returns the smallest positive floating point number including denormalized numbers.

MIN_10_EXP
The smallest negative exponent in a double-precision floating point where 10 raised to this power minus 1.

Usually defaults to -307.

Suggestion to convert to float

Currently, Crass represents 5e100000 as an Integer and 5e-100000 as a Rational.

If exponents are typically represents as floats, the principle of least surprise would suggest that they should probably be parsed into a ruby Float in the first place... It also appears to be the more memory efficient option:

2.6.5 :187 > ObjectSpace.memsize_of(1e305)
 => 40
2.6.5 :188 > ObjectSpace.memsize_of(1e305.to_i)
 => 168
rgrove commented 4 years ago

@veesahni Thanks for that excellent summary. I agree that using a Float sounds like the best option.

I may have time to work on this this weekend, but if you need this fixed sooner a PR would be welcome!

rgrove commented 4 years ago

This is fixed in PR #11. @veesahni, would you mind trying that change out and verifying that it solves your problem?

veesahni commented 4 years ago

In real world samples, we give Crass hundreds/thousands such exponent looking names in a single call to parse. The giant numbers were causing RSS (resident set size) to balloon. Sending similar samples through the new code keeps RSS in check - so yes, this solves the problem!

Appreciate the quick turn around :)