probcomp / bayeslite

BayesDB on SQLite. A Bayesian database table for querying the probable implications of data as easily as SQL databases query the data itself.
http://probcomp.csail.mit.edu/software/bayesdb
Apache License 2.0
922 stars 64 forks source link

Stattype may be guessed to be 'key' incorrectly #475

Closed curlette closed 7 years ago

curlette commented 8 years ago

Stattype guessing in guess.py first searches for a key, where a key is defined as the first column for which there are no None or NaN values and len(col) = len(unique values in col). In tables that don't begin with a rowid column, the key identification method seems very likely to mis-classify a column with a bunch of unique floats as a key. I think that could be improved by also checking if the values have non-zero decimal places (e.g. 3.23, -0.34), which would probably indicate they aren't keys even if they're all unique.

@leocasarsa

leocasarsa commented 8 years ago

This might not be solved for %mml GUESS SCHEMA FOR foo https://probcomp-1.csail.mit.edu:9090/notebooks/casarsa_notebooks/krns_analysis_test_run.ipynb (see example in the fourth cell of the link above)

fsaad commented 8 years ago

@leocasarsa this is a genuine bug, not an installation issue.

@curlette There are two issues causing the bug -- one introduced by Fix #475, which is being masked by a bug in guesser_wrapper. Can you please look into this? The test case is in this notebook:

http://probcomp-1.csail.mit.edu:8888/notebooks/krns_fmri_analysis_BUG_REPRODUCE.ipynb

You can copy the .csv file into your own development environment and explore further.

curlette commented 7 years ago

Fixed in b643c1d0422eb027eb41baa45118a45b60c627bb

fsaad commented 7 years ago

It seems that, after b643c1d that keyable_p([1, 2.13]) returns True, even though the existence of a float should ensure that the column is not keyable. Reopening for further exploration.