twitter / algebird

Abstract Algebra for Scala
https://twitter.github.io/algebird
Apache License 2.0
2.29k stars 345 forks source link

IllegalArgumentException when computing HLL intersection size #509

Open jnievelt opened 8 years ago

jnievelt commented 8 years ago

I suspect this is a sort of edge case in the HLL approximation algo, but it seems there should be a better way to surface this than this exception.

I've reconstructed the case as reported, coming up with two sets with sizes 428 & 395 and intersection size 67:

val set1 = Set(645, 892, 1501, 809, 629, 18360, 760, 1633, 4899, 1729, 1083, 1982, 5205, 30675, 3566, 12507, 88, 411188, 212124, 481, 1158, 762, 63015, 2062, 6463, 1544, 13707, 6830, 4326, 655, 269, 11724, 5967, 385, 4418, 1310, 533, 142, 3044, 31359, 1164, 5146, 814, 2901, 5636, 153, 16456, 2985, 709, 7136, 7235, 1645, 257, 389, 2916, 3425, 1430, 9828, 1298, 2339, 10865, 7683, 7771, 9315, 6662, 817, 21394, 1379, 9480, 10516, 5127, 421, 1479, 2520, 5746, 2158, 3708, 4460, 57, 1362, 1847, 2178, 1489, 492, 164, 9283, 591, 443, 321, 1418, 3983, 5680, 514, 2291, 5985, 5360, 5448, 2833, 2523, 2371, 2137, 11490, 2002, 998, 9236, 18656, 5489, 24794, 1242, 14654, 16858, 507, 7170, 1564, 33823, 4604, 380, 439, 678, 6482, 11032, 108925, 714, 1868, 2061, 5281, 11698, 2264, 44824, 1676, 1487, 192, 3371, 5470, 275, 33, 10721, 1837, 625, 2433, 38539, 522, 3546, 2112, 456, 1740, 6709, 239586, 3105, 7329, 317, 650, 2841, 725, 403, 1878, 53, 5919, 356, 14840, 3746, 5081, 783, 471, 256, 225, 15523, 5496, 3407, 669, 27356, 2080, 33332, 547, 1825, 393, 6752, 647, 1150, 1539, 3059, 1192, 911, 3115, 41, 779, 2140, 2756, 3083, 503, 1333, 2749, 5544, 15121, 792, 1086, 266, 7721, 2208, 1695, 10069, 2057, 1286, 17250, 11769, 1026, 586, 718, 888, 750, 1764, 2176, 513, 166, 3079, 3289, 180, 2376, 42819, 58760, 3319, 402, 772, 24649, 1652, 4926, 17964, 958, 1403, 6019, 18863, 641, 391, 5185, 3051, 3077, 29029, 4099, 11423, 8059, 1090, 804, 10953, 187, 45307, 3996, 219, 2383, 1574, 1048, 4914, 362, 451, 76, 1439, 863, 3456, 10265, 16090, 2182, 758, 535, 21686, 91, 483, 1460, 3150, 2842, 462, 2710, 41638, 69300, 23062, 2752, 1435, 10063, 1090714, 80, 1694, 167, 658, 11887, 5548, 1948, 8965, 1304, 6225, 7650, 9069, 15797, 4859, 1472, 2597, 355, 98071, 707, 954, 1753, 1267, 3058, 4630, 310, 3114, 370, 840, 9358, 8391, 3623, 359, 1285, 9690, 643, 544, 1543, 1257, 613, 154, 1353, 5809, 3224, 1427, 3520, 270470, 1255208, 4217, 845, 231, 63084, 10241, 1084, 4606, 363, 931, 2696, 87, 6835, 4078, 4006, 6882, 1575, 1807, 771, 2591, 6230, 963, 2283, 319, 3484, 1639, 14079, 2532, 4250, 258, 632, 171, 786, 113855, 914, 246, 8649, 7399, 9421, 1513, 12805, 2494, 214, 287, 4125, 15974, 11447, 5728, 31880, 4254, 1097, 405, 183, 373, 603, 5700, 8790, 10064, 126, 341, 3347, 5990, 358, 19212, 195, 4427, 3716, 1561, 9795, 532, 1346, 9441, 860, 1582, 2265, 102960, 723, 5046, 4936, 163, 4330, 16412, 93129, 45095, 2754, 9838, 55878, 3580, 7776, 11286, 204276, 8245, 52371, 33358, 1463, 5183, 1565, 6767, 4423, 676, 1449, 496, 16046, 2655, 1417)

val set2 = Set(645, 3021, 4201, 5469, 138, 479, 30638, 1083, 1982, 6074, 3566, 5141, 192684, 276, 6463, 3377, 6830, 873, 1173, 4543, 6401, 1310, 3942, 23251, 4294, 5782, 538, 6877, 670, 404, 4371, 1645, 142252, 5903, 4198, 3140, 53919, 1803, 2654, 24192, 52, 1709, 4348, 1430, 21794, 1298, 3046, 110, 8022, 1132, 3855, 11760, 7835, 853, 189, 20, 93, 284, 2929, 325, 12864, 2158, 3602, 1232, 1847, 6002, 492, 1418, 1383, 2555, 1183, 6204, 2833, 2190, 6740, 11413, 660, 9236, 5489, 116, 6616, 10523, 6, 270, 85, 205787, 17363, 4312, 1868, 1387, 894, 12613, 18820, 302, 137962, 598, 12247, 70, 192, 407, 429, 1370, 3207, 33, 19980, 2433, 879, 1555, 97, 830, 10794, 1836, 8964, 4417, 3028, 1407, 1291, 974, 8238, 6413, 1234, 15761, 2840, 1762, 36729, 15456, 2841, 6375, 1328, 403, 28279, 1878, 3386, 356, 159807, 3237, 989, 857, 471, 256, 4049, 17890, 3407, 3210, 7111, 2830, 8743, 4545, 1214, 547, 3573, 212, 3242, 13081, 6752, 5273, 4384, 1539, 1456, 3059, 42620, 18044, 21146, 3115, 82280, 4902, 1246, 21775, 41, 2140, 3083, 503, 605, 1393, 2749, 128, 1086, 633, 7721, 805, 674, 6038, 4308, 1732, 2235, 750, 1251, 3130, 2292, 4354, 1103, 166, 21919, 64, 7953, 7382, 180, 689, 17, 7118, 149, 1631, 601, 819, 176, 423, 1884, 2657, 5067, 2393, 577, 291, 11729, 1403, 28059, 204, 295471, 641, 391, 3023, 5595, 8059, 3641, 87099, 10505, 1356, 540, 672, 764, 1194, 26171, 1209, 831, 2383, 2624, 1117, 7, 599, 1439, 3456, 494, 2605, 858, 367, 609, 621, 1325, 3298, 8372, 1177, 462, 1820, 78147, 1044, 20660, 223, 2752, 299, 3512, 2005, 1948, 1203, 162, 531, 83773, 2652, 28084, 112, 1145, 754, 1547, 640, 295, 7801, 95, 55485, 43943, 1252, 16579, 4630, 2850, 310, 30352, 5731, 1452, 1637, 1188, 3475, 127, 2219, 359, 656, 1702, 1749, 727, 978, 11349, 34347, 3256, 3156, 14518, 250, 2458, 3401, 99, 363, 21088, 104, 17772, 16575, 10072, 1385, 2174, 319, 1508, 351, 3336, 24365, 6023, 5368, 171, 139, 1523, 5828, 8, 290, 2041, 6118, 207, 2066, 272, 6569, 12567, 4098, 4619, 716, 7093, 1578, 1395, 49950, 31786, 1986, 183, 3337, 790, 14278, 11009, 59324, 242, 4, 294, 1989, 6169, 341, 2186, 4281, 29507, 437, 3716, 897, 767, 635, 955, 395, 839, 1236, 6190, 140568, 1582, 7071, 131, 6452, 200, 1597, 178, 1902, 442, 57287, 607, 3526, 1606, 3580, 1944, 5494, 309, 2059, 6767, 189885, 639, 4863, 2170, 2106, 496, 16046, 1912)

Using the default hasher and 9 bits:

val hll1 = m.sum(set1.map(m.toHLL(_)))
val hll2 = m.sum(set2.map(m.toHLL(_)))
m.intersectionSize(Seq(hll1, hll2))
java.lang.IllegalArgumentException: requirement failed
  at scala.Predef$.require(Predef.scala:207)
  at com.twitter.algebird.Approximate.withMin(Approximate.scala:131)
  at com.twitter.algebird.HyperLogLogMonoid$$anonfun$intersectionSize$2.apply(HyperLogLog.scala:611)
  at com.twitter.algebird.HyperLogLogMonoid$$anonfun$intersectionSize$2.apply(HyperLogLog.scala:611)
  at scala.Option.map(Option.scala:146)
  at com.twitter.algebird.HyperLogLogMonoid.intersectionSize(HyperLogLog.scala:611)
  ... 39 elided

The size approximations illustrate why our intersection algorithm doesn't work: the min value for the union is larger than the sum of max values for the individual HLLs:

hll1.approximateSize
com.twitter.algebird.Approximate[Long] = Approximate(797,925,1054,0.9972)

hll2.approximateSize
com.twitter.algebird.Approximate[Long] = Approximate(651,755,861,0.9972)

m.plus(hll1, hll2).approximateSize
com.twitter.algebird.Approximate[Long] = Approximate(1917,2223,2531,0.9972)
johnynek commented 8 years ago

I don't know why this require is there:

https://github.com/twitter/algebird/blame/develop/algebird-core/src/main/scala/com/twitter/algebird/Approximate.scala#L143

but I wrote it. :/

What is happening is that we have some range: [a, b] and with prob >= p we are inside that range. Now, we say we know it is at least as large as m but m > b, so clearly the event m in [a, b] didn't happen, so we only know something like: [m, m] is the new range with probability > 0.

The HLL intersection algorithm is correct, I think, it is just that the noise is so large we don't know what the true value is.

I think, we want to fix .withMin to just return (m, m, m, 0.0) in this case, which hopefully consumers see as very uninformed.

Any other ideas?