scala / bug

Scala 2 bug reports only. Please, no questions — proper bug reports only.
https://scala-lang.org
232 stars 21 forks source link

DoS vulnerability in Scala 2.12 HashMap #11203

Closed jroper closed 5 years ago

jroper commented 5 years ago

In 2011, a vulnerability was raised against Java application servers about a DoS possibility that exploited Java String's vulnerability to collisions. This vulnerability was so widespread and fundamental that it was fixed not in the application servers, but in the JDK's HashMap implementation.

Scala's HashMap has the same vulnerability. This has been brought to our attention through this issue raised against play-json:

https://github.com/playframework/play-json/issues/186

This vulnerability doesn't just affect play-json. It affects anything that uses Scala's default map implementation to store String keyed data where the keys are controlled remotely. So, HTTP headers, HTTP forms, JSON, any library that uses Scala's Map for any of these is vulnerable, so that includes Play, Akka HTTP, and many, many other Scala libraries.

The fix that the JDK did is quite simple, when buckets in the hash table got too big due to poor hashing (or malicious collisions), it reverted to essentially using a TreeMap in the bucket, and if, determined by reflection, the keys are Comparable, it uses the compareTo method to compare them.

Currently, we use ListMap in the case of collisions:

  // 32-bit hash collision (rare, but not impossible)
  new HashMapCollision1(hash, ListMap.empty.updated(this.key,this.value).updated(key,value))

Obviously that comment is wrong, if you're under attack, they won't be rare at all. I think a simple solution here would be to modify HashMapCollision1 such that it has both a ListMap and a TreeMap, anything that implements Comparable can be put in and queried from the TreeMap, and everything else in the ListMap.

I don't think there's much consequence to doing this, the biggest impact will be a potential change in iteration order of colliding elements when you merge two maps - if the keys implement Comparable, the ordering will change from keys from the first map followed by keys in the second map to lexical ordering, and if you're mixing Comparable and non Comparable keys, the ordering gets a bit weirder. But it's still stable. And, who really depends on ordering in hash maps? And it only affects ordering when there are collisions. There's also a slight increase in space used, but again, it's only for collisions, for 99.9999% of the world, it will have no impact.

sjrd commented 5 years ago

How do you suggest we test whether something is comparable. Even if it is an instance of Comparable, one cannot know what are valid arguments to the compareTo method. Invalid arguments will throw ClassCastExceptions, so you can't simply create a TreeMap.

Also, that would fix it for Strings, but not for any custom case class that contains a String, because those would suffer from the same poor hash, without being Comparable anyway.

smarter commented 5 years ago

Another option would be to use a random seed: https://doc.rust-lang.org/std/collections/struct.HashMap.html

smarter commented 5 years ago

(some background on SipHash: http://131002.net/siphash/siphash.pdf, see in particular section 7)

smarter commented 5 years ago

... but I guess any kind of randomness wouldn't play well with sending things across the wire through serialization though

szeiger commented 5 years ago

Here's on overview of the current situation (as of 2.13.0-M5): All our hash maps and hash sets are affected (and they all need to be fixed individually)

[info] Benchmark                              (comparable)  (size)  Mode  Cnt           Score           Error  Units
[info] HashCollisionBenchmark.champHashMap           false      10  avgt   10        3087.112 ±       196.344  ns/op
[info] HashCollisionBenchmark.champHashMap           false     100  avgt   10       82826.466 ±      1242.476  ns/op
[info] HashCollisionBenchmark.champHashMap           false    1000  avgt   10     6008195.291 ±     83021.908  ns/op
[info] HashCollisionBenchmark.champHashMap           false   10000  avgt   10   799354669.450 ±  42798033.928  ns/op
[info] HashCollisionBenchmark.champHashMap            true      10  avgt   10        3206.333 ±        66.366  ns/op
[info] HashCollisionBenchmark.champHashMap            true     100  avgt   10       83229.784 ±      1723.577  ns/op
[info] HashCollisionBenchmark.champHashMap            true    1000  avgt   10     6023586.625 ±     99315.284  ns/op
[info] HashCollisionBenchmark.champHashMap            true   10000  avgt   10   801143110.200 ±  37457371.571  ns/op
[info] HashCollisionBenchmark.champHashSet           false      10  avgt   10        2661.455 ±        34.165  ns/op
[info] HashCollisionBenchmark.champHashSet           false     100  avgt   10       47023.016 ±       502.364  ns/op
[info] HashCollisionBenchmark.champHashSet           false    1000  avgt   10     4734575.751 ±     49540.034  ns/op
[info] HashCollisionBenchmark.champHashSet           false   10000  avgt   10   451249571.900 ±   4053862.848  ns/op
[info] HashCollisionBenchmark.champHashSet            true      10  avgt   10        2682.528 ±        59.833  ns/op
[info] HashCollisionBenchmark.champHashSet            true     100  avgt   10       47751.985 ±       751.482  ns/op
[info] HashCollisionBenchmark.champHashSet            true    1000  avgt   10     4830274.370 ±     87407.950  ns/op
[info] HashCollisionBenchmark.champHashSet            true   10000  avgt   10   460022326.500 ±   5370622.228  ns/op
[info] HashCollisionBenchmark.javaHashMap            false      10  avgt   10         435.207 ±         4.483  ns/op
[info] HashCollisionBenchmark.javaHashMap            false     100  avgt   10       78052.174 ±      1002.036  ns/op
[info] HashCollisionBenchmark.javaHashMap            false    1000  avgt   10     6690602.312 ±    132158.499  ns/op
[info] HashCollisionBenchmark.javaHashMap            false   10000  avgt   10  1064348588.700 ±  21725697.542  ns/op
[info] HashCollisionBenchmark.javaHashMap             true      10  avgt   10         435.106 ±         5.557  ns/op
[info] HashCollisionBenchmark.javaHashMap             true     100  avgt   10       10339.026 ±       199.182  ns/op
[info] HashCollisionBenchmark.javaHashMap             true    1000  avgt   10      138461.502 ±      5488.686  ns/op
[info] HashCollisionBenchmark.javaHashMap             true   10000  avgt   10     1383198.954 ±     19537.305  ns/op
[info] HashCollisionBenchmark.javaHashSet            false      10  avgt   10         431.236 ±         5.965  ns/op
[info] HashCollisionBenchmark.javaHashSet            false     100  avgt   10       78141.561 ±      1103.317  ns/op
[info] HashCollisionBenchmark.javaHashSet            false    1000  avgt   10     6740530.754 ±     90071.057  ns/op
[info] HashCollisionBenchmark.javaHashSet            false   10000  avgt   10  1065812482.700 ±  21146822.799  ns/op
[info] HashCollisionBenchmark.javaHashSet             true      10  avgt   10         429.406 ±         8.252  ns/op
[info] HashCollisionBenchmark.javaHashSet             true     100  avgt   10       10467.322 ±       252.287  ns/op
[info] HashCollisionBenchmark.javaHashSet             true    1000  avgt   10      142079.811 ±      3196.623  ns/op
[info] HashCollisionBenchmark.javaHashSet             true   10000  avgt   10     1667129.061 ±    413304.829  ns/op
[info] HashCollisionBenchmark.mutableHashMap         false      10  avgt   10         322.683 ±         5.309  ns/op
[info] HashCollisionBenchmark.mutableHashMap         false     100  avgt   10       19328.574 ±       604.139  ns/op
[info] HashCollisionBenchmark.mutableHashMap         false    1000  avgt   10     1335647.124 ±     34504.419  ns/op
[info] HashCollisionBenchmark.mutableHashMap         false   10000  avgt   10   130574061.538 ±   2898111.233  ns/op
[info] HashCollisionBenchmark.mutableHashMap          true      10  avgt   10         334.757 ±        30.975  ns/op
[info] HashCollisionBenchmark.mutableHashMap          true     100  avgt   10       19717.810 ±       880.020  ns/op
[info] HashCollisionBenchmark.mutableHashMap          true    1000  avgt   10     1349523.554 ±     61111.205  ns/op
[info] HashCollisionBenchmark.mutableHashMap          true   10000  avgt   10   136440902.025 ±   8937760.447  ns/op
[info] HashCollisionBenchmark.mutableHashSet         false      10  avgt   10         210.637 ±        10.214  ns/op
[info] HashCollisionBenchmark.mutableHashSet         false     100  avgt   10       51278.552 ±       862.355  ns/op
[info] HashCollisionBenchmark.mutableHashSet         false    1000  avgt   10     7254599.344 ±     77418.016  ns/op
[info] HashCollisionBenchmark.mutableHashSet         false   10000  avgt   10   582626129.000 ±  22068501.128  ns/op
[info] HashCollisionBenchmark.mutableHashSet          true      10  avgt   10         207.018 ±         3.522  ns/op
[info] HashCollisionBenchmark.mutableHashSet          true     100  avgt   10       50332.322 ±       722.652  ns/op
[info] HashCollisionBenchmark.mutableHashSet          true    1000  avgt   10     7361789.516 ±    153332.591  ns/op
[info] HashCollisionBenchmark.mutableHashSet          true   10000  avgt   10   581922084.150 ±  10862902.903  ns/op
[info] HashCollisionBenchmark.oldHashMap             false      10  avgt   10        1509.231 ±        92.854  ns/op
[info] HashCollisionBenchmark.oldHashMap             false     100  avgt   10       59816.860 ±     29110.914  ns/op
[info] HashCollisionBenchmark.oldHashMap             false    1000  avgt   10     8008288.752 ±     82106.263  ns/op
[info] HashCollisionBenchmark.oldHashMap             false   10000  avgt   10  1180388228.050 ± 554803591.812  ns/op
[info] HashCollisionBenchmark.oldHashMap              true      10  avgt   10        1447.969 ±        33.201  ns/op
[info] HashCollisionBenchmark.oldHashMap              true     100  avgt   10       59973.289 ±     28878.087  ns/op
[info] HashCollisionBenchmark.oldHashMap              true    1000  avgt   10     8123624.173 ±    177455.498  ns/op
[info] HashCollisionBenchmark.oldHashMap              true   10000  avgt   10  1117844772.850 ± 476160439.425  ns/op
[info] HashCollisionBenchmark.oldHashSet             false      10  avgt   10         341.151 ±         6.199  ns/op
[info] HashCollisionBenchmark.oldHashSet             false     100  avgt   10       15054.079 ±       659.371  ns/op
[info] HashCollisionBenchmark.oldHashSet             false    1000  avgt   10     4705176.794 ±     92712.245  ns/op
[info] HashCollisionBenchmark.oldHashSet             false   10000  avgt   10   469873977.200 ±   8556272.917  ns/op
[info] HashCollisionBenchmark.oldHashSet              true      10  avgt   10         336.080 ±         5.549  ns/op
[info] HashCollisionBenchmark.oldHashSet              true     100  avgt   10       15147.252 ±      1076.673  ns/op
[info] HashCollisionBenchmark.oldHashSet              true    1000  avgt   10     4761012.752 ±     77525.665  ns/op
[info] HashCollisionBenchmark.oldHashSet              true   10000  avgt   10   463202453.300 ±  37754187.404  ns/op

Code in https://github.com/scala/scala/compare/2.13.x...szeiger:wip/hash-collision

szeiger commented 5 years ago

BTW, we use different collision resolution schemes in our implementation. I haven't looked at all the immutable maps and sets yet but I just started working on mutable HashMaps and HashSets (independently of this ticket). mutable.HashMap uses separate chaining with linked lists. This could be changed to tree-based chaining while keeping the general performance/memory characteristics. mutable.HashSet uses open addressing where the proposed fix doesn't apply. We'd have to change it to use separate chaining, too.

Ichoran commented 5 years ago

I think the Java solution was probably expedient for Java, but it isn't inherently the right way to fix the issue. If you want a HashMap which is robust to an attack on the hash values of the keys, you can't guarantee success if you put off until runtime the attempt to use compareTo.

Instead, we need a new type of map where the keys are orderable but are not necessarily ordered. (Likewise with sets.)

Then we have the compile-time guarantee (so long as the ordering is sensible) that DOS cannot occur. This is far better for people wishing to produce robust systems, I think.

Alternatively, if we want a quick hacky fix, I'd just intercept the hashing of String and use a better hash on it to reduce the risk. Changing our map datastructures to include a tree that won't even work, and doesn't play nice with things that can be ordered via a typeclass but don't have the trait baked in, is not a direction that I think we should go.

SethTisue commented 5 years ago

discussion at https://gitter.im/scala/contributors?at=5bc1dbe41e23486b93b70784 ("Do people expect hash maps to be secure against collisions of untrusted data?")

szeiger commented 5 years ago

Customizable hashing is another option. If hash collections alloweduser-defined hashing methods you could use a more secure seeded hash for security-critical applications.

szeiger commented 5 years ago

And no matter which way we go, 2.12.8 as a target is probably not possible (at least not for all affected collection types). Even with the magical behind-the-scenes hack that Java uses you need to change the internal data structures, which are not actually internal. They are exposed with package-private visibility and actively used by libraries like https://github.com/scala/scala-java8-compat/

jrudolph commented 5 years ago

If this is only (mostly) about the DOS attack vectors, could we just limit the number of allowed collisions and fail if there's a (configurable) suspicious ratio of collisions?

jrudolph commented 5 years ago

... but I guess any kind of randomness wouldn't play well with sending things across the wire through serialization though

You mean, under the expectation that you can compare hashCodes across JVM instances?

Otherwise, you don't really have to transfer hash codes with serialization and just recalculate them when deserializing.

smarter commented 5 years ago

I really don't want my hashmaps to have arbitrary implementation defined failures. if inserting certain elements starts failing in some way then my application is crippled, and this can still cause denial of services

smarter commented 5 years ago

You mean, under the expectation that you can compare hashCodes across JVM instances?

Yes, no clue what would break if this assumption is violated

jrudolph commented 5 years ago

I really don't want my hashmaps to have arbitrary implementation defined failures. if inserting certain elements starts failing in some way then my application is crippled, and this can still cause denial of services

It's not the same failure mode, though. It's like rejecting inputs that would lead to an OOM preventively with an exception. In this case, it would throw an exception if input was detected which has characteristics that don't fit the assumptions about your data when you chose a HashMap in the first place (i.e. that hashCodes are uniformly distributed). In which way would that cripple the application on innocent inputs?

smarter commented 5 years ago

Imagine a persistent hashmaps shared between multiple users, if I fill it with enough malicious data, then any attempt by other users to add something to it would throw an exception

jrudolph commented 5 years ago

Imagine a persistent hashmaps shared between multiple users, if I fill it with enough malicious data, then any attempt by other users to add something to it would throw an exception

Probably not, because while adding malicious data would push the ratio (of collisions per size) towards the limit adding any other data would in average move the ratio away from the limit.

SethTisue commented 5 years ago

spray-json switched to TreeMap: https://github.com/spray/spray-json/issues/277 ("for reasonably sized objects with a size < 100, a TreeMap is only ~6% slower than a HashMap")

lrytz commented 5 years ago

Summary of our discussions.

2.12.8

@retronym said (I hope not to twist his words) he thinks the issue is well-known, and programmers would select the right data strucutres in security-sensitive cases.

What would be our recommendations for 2.12?

jrudolph commented 5 years ago
  • Use the Java implementations (with Scala wrappers?) for String keys?

How would you do it in practice? Wrapping a mutable collection would mean copying the whole underlying map on each update?

Did you actually consider adding a limit to the number of collisions? I'd rather have guaranteed performance characteristics than degrading performance (and added complexity) just to cover adverse cases.

I guess it would need a bit of research finding a reasonable number of collisions to consider.

I didn't find a closed form formula that would give the probability that you would find at least one k-way-collision when sorting 2^31 elements (max size of collection) into 2^32 bins (hashCode gives 32bit values). I sorted 100M random numbers into 200M bins and found a few bins with 8 collisions. Obviously, that's not a proper experiment as Java's hashCodes are notoriously badly distributed. Java's HashMap chose length 8 before switching to a TreeMap, but probably more for the reason that linear scanning may just be cheaper before. If we would limit it e.g. to 100 (even hard-code that limit), we would have a bit of leeway for badly distributed hashCodes but would still avoid those really expensive cases.

You may find this limit arbitrary but I'd say it's not more arbitrary than the 2^31 size limit or other limits.

@retronym said (I hope not to twist his words) he thinks the issue is well-known, and programmers would select the right data strucutres in security-sensitive cases.

I think that has already been somewhat disproven with just this set of cases.

szeiger commented 5 years ago

How would you do it in practice? Wrapping a mutable collection would mean copying the whole underlying map on each update?

This is the recommendation for mutable maps. The best we have in 2.12 for immutable maps is TreeMap, which may or may not be an acceptable alternative performance-wise.

smarter commented 5 years ago

@SethTisue Shouldn't this be kept open until we at least have proper user documentation on which maps are "DoS-safe" and which aren't ?

SethTisue commented 5 years ago

I wouldn't be opposed to be re-opening it, but I don't think it's a blocker for 2.12.8. @lrytz?

He-Pin commented 5 years ago

Could reschedule to 2.12.9.

lrytz commented 5 years ago

It's not blocking 2.12.8. I think @szeiger is working on a ordering-based implementation for 2.13?

SethTisue commented 5 years ago

I think @szeiger is working on a ordering-based implementation for 2.13?

https://github.com/scala/scala/pull/7633 was merged and included in 2.13.0

perhaps a pull request backporting CollisionProofHashMap to 2.12+2.11 would be accepted as an addition to https://github.com/scala/scala-collection-compat

plokhotnyuk commented 5 years ago

@SethTisue I have created a ticket on behalf of you: https://github.com/scala/scala-collection-compat/issues/234

plokhotnyuk commented 5 years ago

Currently any Scala collection is vulnerable when affected maps or sets are used internally in methods like: toMap, toSet, keys, keySet, distinct, groupBy, --, etc.