Bad guess for numerical data with one very common value

probcomp / bayeslite

BayesDB on SQLite. A Bayesian database table for querying the probable implications of data as easily as SQL databases query the data itself.

http://probcomp.csail.mit.edu/software/bayesdb

Apache License 2.0

921 stars 64 forks source link

Bad guess for numerical data with one very common value #507

Open apuranik1 opened 7 years ago

apuranik1 commented 7 years ago

I have a dataset of about 18000 rows. For a particular column, just over half of the values are 0 (the data measures solar irradiance, and these are nighttime measurements). The remaining values are integers distributed between 8 and 154. The most commonly repeated nonzero values appear about 150 times each. The column's data has around three prominent modes and reasonable-looking tails. Bayeslite is guessing NOMINAL for this column instead of NUMERICAL.

curlette commented 7 years ago

Hi @apuranik1, just wondering if you could send me this dataset when you get a chance, so I could look into what exactly about the current stattype guessing heuristics caused this to occur.