probcomp / BayesDB

A Bayesian database table for querying the probable implications of data as easily as SQL databases query the data itself. New implementation in http://github.com/probcomp/bayeslite
http://probcomp.csail.mit.edu/software/bayesdb/
Apache License 2.0
889 stars 52 forks source link

KeyError when running INFER command #11

Closed huroh closed 10 years ago

huroh commented 10 years ago

After importing samples to a table, and running the INFER function to predict a column of omitted values, I always get this inscrutable 'KeyError':

INFER completed FROM test WITH CONFIDENCE 0.9 LIMIT 20

Traceback (most recent call last): File "run_script.py", line 27, in run_example() File "run_script.py", line 12, in run_example client(open(file_path, 'r'), wait=True) File "/home/bayesdb/bayesdb/bayesdb/client.py", line 59, in call return self.execute(call_input, pretty, timing, wait, plots) File "/home/bayesdb/bayesdb/bayesdb/client.py", line 82, in execute result = self.execute_line(line, pretty, timing) File "/home/bayesdb/bayesdb/bayesdb/client.py", line 104, in execute_line result = self.call_bayesdb_engine(method_name, args_dict) File "/home/bayesdb/bayesdb/bayesdb/client.py", line 55, in call_bayesdb_engine out = method(*args) File "/home/bayesdb/bayesdb/bayesdb/engine.py", line 259, in infer imputations_list = [(r, c, du.convert_code_to_value(M_c, c, code)) for r,c,code in ret] File "/home/bayesdb/crosscat/crosscat/utils/data_utils.py", line 281, in convert_code_to_value return M_c['column_metadata'][cidx]['value_to_code'][str(int(code))] KeyError: '26'

Relevant files: https://www.dropbox.com/sh/ulxptvzwy3n0wtg/Z2nWz5TiEm/bayesdb

huroh commented 10 years ago

I get a similar KeyError

INFER * FROM training WITH CONFIDENCE 0.9 LIMIT 20

Traceback (most recent call last): File "run_script.py", line 27, in run_example() File "run_script.py", line 12, in run_example client(open(file_path, 'r'), wait=True) File "/home/bayesdb/bayesdb/bayesdb/client.py", line 59, in call return self.execute(call_input, pretty, timing, wait, plots) File "/home/bayesdb/bayesdb/bayesdb/client.py", line 82, in execute result = self.execute_line(line, pretty, timing) File "/home/bayesdb/bayesdb/bayesdb/client.py", line 104, in execute_line result = self.call_bayesdb_engine(method_name, args_dict) File "/home/bayesdb/bayesdb/bayesdb/client.py", line 55, in call_bayesdb_engine out = method(*args) File "/home/bayesdb/bayesdb/bayesdb/engine.py", line 250, in infer out = self.backend.impute_and_confidence(M_c, X_L_list, X_D_list, Y, [q], numsamples) File "/home/bayesdb/crosscat/crosscat/LocalEngine.py", line 382, in impute_and_confidence e,confidence = su.impute_and_confidence(M_c, X_L, X_D, Y, Q, n, self.get_next_seed) File "/home/bayesdb/crosscat/crosscat/utils/sample_utils.py", line 677, in impute_and_confidence return_samples=True) File "/home/bayesdb/crosscat/crosscat/utils/sample_utils.py", line 626, in impute get_next_seed, n) File "/home/bayesdb/crosscat/crosscat/utils/sample_utils.py", line 264, in simple_predictive_sample_multistate get_next_seed, this_n) File "/home/bayesdb/crosscat/crosscat/utils/sample_utils.py", line 233, in simple_predictive_sample M_c, X_L, X_D, Y, query_row, query_columns, get_next_seed, n) File "/home/bayesdb/crosscat/crosscat/utils/sample_utils.py", line 283, in simple_predictive_sample_observed which_cluster) File "/home/bayesdb/crosscat/crosscat/utils/sample_utils.py", line 515, in create_cluster_model_from_X_L zipped_column_info, row_partition_model, cluster_idx File "/home/bayesdb/crosscat/crosscat/utils/sample_utils.py", line 374, in create_cluster_model column_metadata, column_hypers, cluster_component_suffstats) File "/home/bayesdb/crosscat/crosscat/utils/sample_utils.py", line 356, in create_component_model count = suffstats.pop('N') KeyError: 'N'

when running: DROP BTABLE training; CREATE BTABLE training FROM data_selectedCols_trainingAndTest_outcome_subset.csv; CREATE 10 MODELS FOR training; INFER * FROM training WITH CONFIDENCE 0.9 LIMIT 20;

on this file https://dl.dropboxusercontent.com/u/68514/bayesdb/data_selectedCols_trainingAndTest_outcome_subset.csv

jbaxter commented 10 years ago

Thanks for the bug report Hubert. We are aware of both of these issues, and will have fixes in the next release. Sorry about that! In the meantime, a hacky way you can fix the first one is removing 'str(...)' in line 281 in data_utils.py, so it looks like return M_c['column_metadata'][cidx]['value_to_code'][int(code)]. For the second, you may have to wait until the next release (which should be relatively soon).

huroh commented 10 years ago

Thanks - looking forward to the next release! What's the difference between the two errors? In the first example I am trying to INFER on a numerical variable after having imported samples. In the second, I am trying to INFER on a categorical variable, without any import.

jbaxter commented 10 years ago

So, without running ANALYZE or importing samples, the results for INFER won't make sense. But bugs like this shouldn't depend on whether you've run ANALYZE or imported samples. If you find a case where ANALYZE or import samples causes a bug like this to appear or disappear, please report it.

The difference that's causing the bug is just INFERing on categorical data, I believe.

huroh commented 10 years ago

Ah ha - thanks - I managed to miss that key point :-)

jbaxter commented 10 years ago

Fix is in the postgresdev branch -- will be merged to master when the release occurs.