sramirez / spark-MDLP-discretization

Spark implementation of Fayyad's discretizer based on Minimum Description Length Principle (MDLP)
Apache License 2.0
44 stars 27 forks source link

Splits sometimes have excessive precision #13

Closed barrybecker4 closed 8 years ago

barrybecker4 commented 8 years ago

This may be minor but I sometimes noticed that the bins have many decimals of precision like 5.550000190734863 when the data itself has very limited precision. For example, the above split was determined for "sepal length" in the traditional iris dataset. The values of sepal length in the data are things like 4.9 and 5.1 - none of the values have more than one digit, so why does the split need to have 15 decimal places?

sramirez commented 8 years ago

I think it doesn't affect the final result and some extra precision is saved by using float-decimal precision instead of double format. Extra decimals could be important in other problems. For example, problems where the integer part is always 0 and the decimal part has many relevant digits.

barrybecker4 commented 8 years ago

This is a very minor issue, but since I show these numbers in a UI it would be nicer for the user if they did not have excessive precision. I haven't looked into it a lot yet, but I feel that whenever a new split is added, the new split value should never need more than one more significant digit more than the endpoints of the parent bucket. The relevant code is result = ((lastK, (x + lastX) / 2), accumFreqs.clone) +: result So if x and lastX are 101.2 and 101.5. respectively, then the split should be 101.35. I will need to do some debugging to see where the extra precision is coming from and if it is needed.

barrybecker4 commented 8 years ago

Inspecting the midpoint caclulation, I see cases where the midpoint has more decimals than it should - probably because of the difficulty of representing decimals with binary. For example:

x = 29.9 lastX 29.8 mid = 29.849998 x = 31.9 lastX 31.8 mid = 31.849998 x = 28.8 lastX 28.4 mid = 28.599998 x = 6.975 lastX 6.95 mid = 6.9624996 x = 7.225 lastX 7.1417 mid = 7.1833496 x = 7.8542 lastX 7.8292 mid = 7.8416996 x = 7.925 lastX 7.8958 mid = 7.9104004 x = 7.925 lastX 7.8958 mid = 7.9104004 x = 0.75 lastX 0.67 mid = 0.71000004 x = 0.83 lastX 0.75 mid = 0.78999996 x = 1.0 lastX 0.92 mid = 0.96000004

barrybecker4 commented 8 years ago

I did some work on this in a branch, but did not do a PR because I could not get it to work the way I wanted. I still think that there is a minor issue in that splits are sometimes things like 1.4999999 when they should be 1.5 (for example).