probcomp / bayeslite

BayesDB on SQLite. A Bayesian database table for querying the probable implications of data as easily as SQL databases query the data itself.
http://probcomp.csail.mit.edu/software/bayesdb
Apache License 2.0
922 stars 64 forks source link

CGPM_Metamodel should casefold categorical conditions #508

Open fsaad opened 7 years ago

fsaad commented 7 years ago

Consider using lower case meo, which is being converted to null and hence probability 1:

bdb = bayeslite.bayesdb_open('satellites.2048.bdb')
query(bdb, '''
    ESTIMATE PROBABILITY OF class_of_orbit = 'meo'
            GIVEN (period_minutes=850)
        WITHIN satellites_p;
''')
Out[1]: 
   bql_pdf_joint(1, NULL, 5, 'meo', NULL, 10, 850)
0                                              1.0

versus using upper case MEO, which is being converted to the correct small integer code


bdb = bayeslite.bayesdb_open('satellites.2048.bdb')
query(bdb, '''
    ESTIMATE PROBABILITY OF class_of_orbit = 'MEO'
            GIVEN (period_minutes=850)
        WITHIN satellites_p;
''')
Out[2]: 
   bql_pdf_joint(1, NULL, 5, 'MEO', NULL, 10, 850)
0                                         0.815711
``
fsaad commented 7 years ago

Updating the schema of bayesdb_cgpm_category

https://github.com/probcomp/bayeslite/blob/master/src/metamodels/cgpm_metamodel.py#L58-L65

to use value TEXT COLLATE NOCASE NOT NULL will likely solve the issue.

git blame shows that @riastradh-probcomp is author of the schema, perhaps he can weigh in as to why the NOCASE COLLATE was not used, and similarly for the categorical code map for crosscat.py in

https://github.com/probcomp/bayeslite/blob/master/src/metamodels/crosscat.py#L78-L87.