isarn / isarn-sketches

Sketching data structures for scala, including t-digest
Apache License 2.0
15 stars 5 forks source link

C (compression) and M (maxDiscrete) parameters return the same results #16

Open velwu opened 3 years ago

velwu commented 3 years ago

Greetings, Mr. Erlandson (@erikerlandson),

The team I represent recently started using this solution and we've been trying to tune accuracy and performance via different C (compression) and M (maxDiscrete) parameters.

In our case, we had a dataframe called "originDf" which has 3 columns ("id": String, "fruit_type": String, "count": Integer)

+------+--------------------+-----------+
|id |fruit_type |count | +------+--------------------+-----------+ |001|Apples |2 | |002|Apricots |79 | |001|Avocados |4 | |003|Watermelon |13 | |007|Blueberries |5 | |007|Cherries |6 | |007|Clementine |41 | |007|Cucumbers |5 | |007|Elderberry |3 | |007|Eggfruit |1 | |008|Eggfruit |19 | |012|Clementine |61 | |013|Blueberries |21 | |014|Blueberries |4 | |...|Lime |4 | |...|Rambutan |3 | |...|Strawberries |6 | |...|Watermelon |5 | |...|Tangerine |3 | |...|Tangerine |6 | +------+--------------------+-----------+ This dataframe has roughly 0.2 billion rows in total.

We then did the following:

val t_digest_udf1= TDigestAggregator.udf[Double](compression = C, maxDiscrete = M) val groupDf1 = originDf.groupBy(col("fruit_type")).agg(t_digest_udf1(col("count")) as "t_digests",) val udf1 = groupDf1.first() val t1 = udf1.getAsTDigest

where (C, M) is respectively (100, 100), (0.01, 100), (100, 10000), and the resulting single TDigest from each set as t1, t2, and t3

t1: org.isarnproject.sketches.java.TDigest = TDigest(1.0 -> (22240.0, 22240.0), 2.0 -> (6509.0, 28749.0), 3.0 -> (2936.0, 31685.0), 4.0 -> (1594.0, 33279.0), 5.0 -> (1096.0, 34375.0), 6.0 -> (767.0, 35142.0), 7.0 -> (523.0, 35665.0), 8.0 -> (404.0, 36069.0), 9.0 -> (358.0, 36427.0), 10.0 -> (284.0, 36711.0), 11.0 -> (201.0, 36912.0), 12.0 -> (189.0, 37101.0), 13.0 -> (162.0, 37263.0), 14.0 -> (120.0, 37383.0), 15.0 -> (98.0, 37481.0), 16.0 -> (86.0, 37567.0), 17.0 -> (68.0, 37635.0), 18.0 -> (69.0, 37704.0), 19.0 -> (63.0, 37767.0), 20.0 -> (50.0, 37817.0), 21.0 -> (50.0, 37867.0), 22.0 -> (44.0, 37911.0), 23.0 -> (37.0, 37948.0), 24.0 -> (31.0, 37979.0), 25.0 -> (29.0, 38008.0), 26.0 -> (24.0, 38032.0) ...)

t2: org.isarnproject.sketches.java.TDigest = TDigest(1.0 -> (22240.0, 22240.0), 2.0 -> (6509.0, 28749.0), 3.0 -> (2936.0, 31685.0), 4.0 -> (1594.0, 33279.0), 5.0 -> (1096.0, 34375.0), 6.0 -> (767.0, 35142.0), 7.0 -> (523.0, 35665.0), 8.0 -> (404.0, 36069.0), 9.0 -> (358.0, 36427.0), 10.0 -> (284.0, 36711.0), 11.0 -> (201.0, 36912.0), 12.0 -> (189.0, 37101.0), 13.0 -> (162.0, 37263.0), 14.0 -> (120.0, 37383.0), 15.0 -> (98.0, 37481.0), 16.0 -> (86.0, 37567.0), 17.0 -> (68.0, 37635.0), 18.0 -> (69.0, 37704.0), 19.0 -> (63.0, 37767.0), 20.0 -> (50.0, 37817.0), 21.0 -> (50.0, 37867.0), 22.0 -> (44.0, 37911.0), 23.0 -> (37.0, 37948.0), 24.0 -> (31.0, 37979.0), 25.0 -> (29.0, 38008.0), 26.0 -> (24.0, 38032.0) ...)

t3: org.isarnproject.sketches.java.TDigest = TDigest(1.0 -> (22240.0, 22240.0), 2.0 -> (6509.0, 28749.0), 3.0 -> (2936.0, 31685.0), 4.0 -> (1594.0, 33279.0), 5.0 -> (1096.0, 34375.0), 6.0 -> (767.0, 35142.0), 7.0 -> (523.0, 35665.0), 8.0 -> (404.0, 36069.0), 9.0 -> (358.0, 36427.0), 10.0 -> (284.0, 36711.0), 11.0 -> (201.0, 36912.0), 12.0 -> (189.0, 37101.0), 13.0 -> (162.0, 37263.0), 14.0 -> (120.0, 37383.0), 15.0 -> (98.0, 37481.0), 16.0 -> (86.0, 37567.0), 17.0 -> (68.0, 37635.0), 18.0 -> (69.0, 37704.0), 19.0 -> (63.0, 37767.0), 20.0 -> (50.0, 37817.0), 21.0 -> (50.0, 37867.0), 22.0 -> (44.0, 37911.0), 23.0 -> (37.0, 37948.0), 24.0 -> (31.0, 37979.0), 25.0 -> (29.0, 38008.0), 26.0 -> (24.0, 38032.0) ...)

It turns out t1, t2 & t3 (all of the org.isarnproject.sketches.java.TDigest class) look exactly the same value-wise. Though Boolean checks such as t1.equal(t2) all return false, which indicates these are different TDigest entities somehow.

Do you see anything off in this usage? Did we use the TDigest correctly, and more importantly, is our understanding of the algorithm correct?

Thank you and we look forward to your responses.

erikerlandson commented 3 years ago

The maxDiscrete parameter tells the t-digest algorithm to store up to that many unique values exactly, with exact frequencies (so, effectively, a histogram). I can't be sure without examining all the data, but it looks like your count data might not take on more than 100 unique values, and so with M >= 100 in all your parameter cases, I am not surprised it is giving you the same result each time.

Regarding compression C - this corresponds to the delta parameter from the original t-digest paper. The default is 0.5, which is targeted to yield on the order of 100 cluster-centers (plus or minus). Any value > 1 is likely too high, and 100 is definitely too high. (you probably aren't noticing because max-discrete is preventing it from taking effect). I should probably mention this fact in the documentation.

https://isarn.github.io/isarn-sketches/scala/api/org/isarnproject/sketches/TDigest$.html#deltaDefault:Double

Let me know if you continue having issues. If you want to observe different results, try setting max-discrete to values like 25 or 50, or zero, and varying the compression values but keeping them <= 1.

velwu commented 3 years ago

Mr. Erlandson,

Thank you for the quick response and detailed explanation.

We've observed some tiny differences with our TDigest using M = 25 and 0, though changing C has not brought about any difference so far.

Though the aggregated results* still remain the exact same even with extreme values like C = 0.000001 or M = 100000, and they are under-estimating the true results. We counted the uniques via some Spark SQL functions and most fruit_type have anywhere between 100 and 3,000 unique discrete values. Distributions are extremely right-skewed, so most rows are somewhere between 0~10, with outliers that are in the range of many thousands.

Is there something in the package that particularly excels at dealing with distributions that are so skewed like this? I think i've heard you mentioning working on something similar in your Spark Summit speech, but couldn't grasp the entirety of it.

*We use the cdfInverse() function in this package to obtain a "threshold number" which describes the floor value of top 30% consumer(s) of any fruit_type so the program gives that consumer a label (if s/he is top 30%). For example, if a threshold number returned by cdfInverse() is 60 for Avocados, then whoever has bought more than 60 Avocados is given a label like "Avocado enjoyer". So far, except 1 label (which was estimated perfectly, 0.00% difference), these label counts have been consistently under-estimated when aggregated, with most fruit_type missing the mark by -40% ~ -50% (we have the true label counts to verify). Every combination of C and M parameters give the exact same estimations.

As you may have guessed by now, this process involved a few other table joins and transformations that I have not disclosed here. I could see about revealing some of that, if you think it'll help the explanation.

Thank you in advance.

erikerlandson commented 3 years ago

@velwu can you attach some example data from your use case? A single column of numeric values that you are trying to sketch, which is giving you trouble, would be sufficient for me to get a feel for what the algorithm is actually doing and possible solutions

Speaking in generalities: the t-digest is designed to allocate more resolution at the tails of a distribution (by making cluster sizes smaller at the tails). However, it is also still a sketch, and so if a distribution has an extremely long tail, I am unsure what its behavior would be.

velwu commented 3 years ago

Mr. Erlandson (@erikerlandson),

Yes, thank you. Here are some of that example data. These were generated using C (compression rate) = 0.5 and M (Max Discrete) = 25

top30percent_num is a "threshold number", meaning there are 30% of the population which are strictly greater than (so a ">" expression, not a ">=") the value of 4.0

centroids is an array obtained using TDigest.getCentUnsafe.mkString("Array(", ", ", ")")

masses is an array obtained using TDigest.getMassUnsafe.mkString("Array(", ", ", ")")

===== DATA BELOW =====

fruit_type = Tayberry

top30percent_num = 4.0

centroids = Array(1.0, 2.0, 3.0, 4.0, 5.0, 6.000073393521917, 7.000145750045635, 8.001188117725917, 9.003333396428566, 10.008484086672457, 11.035982725595119, 12.436242825159047, 13.935279735232925, 15.098994579197218, 16.511259867781114, 18.260182142253313, 19.86941640294012, 21.362021349000482, 22.841753038589435, 24.317914199903715, 25.96390094634534, 27.983366298230738, 30.489735175798966, 33.51140603472986, 37.2055873257278, 41.44547946599917, 46.304810343084505, 51.57747302397325, 56.94515134503221, 62.41283791796801, 68.20791010944029, 74.6865420828331, 82.69414647439031, 92.770435998285, 104.70655644456826, 119.0220386505123, 136.4907987446255, 159.05861224356724, 186.68287985110416, 217.53985402457968, 247.63889721835227, 274.4762959357546, 299.3535887854102, 324.0647354917961, 350.8301681602125, 382.09159905518254, 411.7050695585377, 439.39454418428744, 467.6237175057026, 499.6498284103393, 537.3330704370983, 579.8583082086442, 632.3023366378322, 693.3623139333414, 751.1584889780127, 801.4966764041216, 845.7856180034878, 891.0909090909093, 939.8145365243353, 998.8138400744841, 1066.5521025124158, 1131.6498041132077, 1198.63848992374, 1254.5925924620633, 1303.8717434385364, 1360.8831857872362, 1416.3451150421936, 1480.1245331341975, 1533.5714285714287, 1566.9512172885884, 1599.0512769215163, 1661.2500000000002, 1744.473668569509, 1787.3076868071576, 1822.0, 1861.6666619188447, 1881.999986646775, 1928.0, 1974.0, 2064.0, 2070.4000021711554, 2178.666659149131, 2219.99998081348, 2282.0, 2293.4285750385197, 2304.0, 2421.0, 3396.0)

masses = Array(5398281.0, 3633088.0, 2460545.0, 1689881.0, 1179151.0, 842740.8517191977, 610634.4571238874, 453918.6911569149, 346840.2754764726, 271727.31541708275, 222724.34757794722, 290800.7335715829, 138476.42502237728, 119751.83242871228, 140158.8590997329, 117470.43421967147, 77971.99400099477, 71570.91652402612, 55204.863267924055, 53772.480326603574, 51903.72173838791, 58408.46021909827, 53086.57077887626, 54061.016527306994, 47315.581558571066, 40205.249888977414, 34011.498867577844, 25979.426857923092, 19550.889272686665, 16287.614592011409, 13231.863246870456, 12709.078315434372, 12420.159968650645, 11807.999980939347, 10246.373476650368, 9220.554967032673, 8640.146864860595, 7896.745173329689, 6409.272525296329, 4516.915520130643, 3173.5667872684385, 2161.224538285732, 1699.762085969193, 1497.8649416683236, 1483.7521192429406, 1347.542964248235, 987.6982021801848, 826.1687999458092, 778.5582398326427, 796.893712157504, 752.4167177490278, 717.6838723017004, 682.2497454108154, 528.4997272212621, 356.5002727787379, 186.75017624016272, 149.24982375983728, 132.0, 130.74996542265004, 126.25003457734996, 88.749984826661, 64.250015173339, 53.24999256593924, 33.75000743406076, 26.31250730264361, 21.93748591423703, 21.187496363307265, 17.562510419812092, 7.0, 10.249997731867541, 9.750002268132459, 8.0, 4.749997848168963, 3.250002151831037, 3.0, 2.9999973293525932, 3.0000026706474068, 2.0, 3.0, 0.7500004240536535, 1.2499995759463465, 2.999998990182134, 1.0000010098178662, 0.2500005527731277, 1.7499994472268723, 1.0, 1.0, 1.0)

erikerlandson commented 3 years ago

thanks @velwu - can you attach a file with the raw count values? If it is too large to attach here, email me at eje@redhat.com ?

erikerlandson commented 3 years ago

I ran a sketch of the "tayberry" data, with the following results. I'd consider the following to be a good sketch of the data. cdfInverse is assuming a linear approximation to the CDF, and so it will usually not map to any exact integer x-value. cdfDiscreteInverse is agreeing well with the value computed directly from the sorted data.

scala> val data = io.Source.fromFile("/home/eje/tayberry_raw.csv").getLines.toVector.map(_.toInt)
data: scala.collection.immutable.Vector[Int] = Vector(1, 5, 2, 2, ...

scala> import org.isarnproject.sketches.TDigest
import org.isarnproject.sketches.TDigest

scala> val td = TDigest.sketch(data)
td: org.isarnproject.sketches.TDigest = TDigest(0.5,0,63,TDigestMap(1.0 -> (5644016.0, 5644016.0), 2.0 -> (3410123.0, 9054139.0), 3.0 -> (2086517.0, 1.1140656E7), 4.0 -> (1314806.0, 1.2455462E7), ...

scala> val datasort = data.sortWith((a, b)=>(a < b))
datasort: scala.collection.immutable.Vector[Int] = Vector(1, 1, 1,...

scala> def quantiles(td:TDigest, sorted: IndexedSeq[Int], q: Double) = (sorted((q * sorted.length).toInt), td.cdfInverse(q), td.cdfDiscreteInverse(q))
quantiles: (td: org.isarnproject.sketches.TDigest, sorted: IndexedSeq[Int], q: Double)(Int, Double, Double)

scala> quantiles(td, datasort, 0.1)
res8: (Int, Double, Double) = (1, 1.2197759786857603, 1.0)

scala> quantiles(td, datasort, 0.3)
res9: (Int, Double, Double) = (1, 1.6593279360572806, 1.0)

scala> quantiles(td, datasort, 0.5)
res10: (Int, Double, Double) = (2, 2.0, 2.0)

scala> quantiles(td, datasort, 0.7)
res11: (Int, Double, Double) = (4, 3.710698395888893, 4.0)

scala> quantiles(td, datasort, 0.9)
res12: (Int, Double, Double) = (8, 8.0, 8.0)
velwu commented 3 years ago

Then it would seem this has been a case of identifying the right function for the job all along. The discrete variant cdfDiscreteInverse() returns the exact same values as the ones from the sorted data, which is encouraging.

Thank you. I will notify my team about this and observe what improvements our use case can get from it. Another comment will be made here soon.