probcomp / bayeslite

BayesDB on SQLite. A Bayesian database table for querying the probable implications of data as easily as SQL databases query the data itself.
http://probcomp.csail.mit.edu/software/bayesdb
Apache License 2.0
923 stars 64 forks source link

guess IGNORE on large categoricals #113

Closed riastradh-probcomp closed 8 years ago

riastradh-probcomp commented 9 years ago
tibbetts commented 9 years ago

Detecting categoricals that have too many distinct values to be useful and warning or erroring on them?

@gregory-marton might say we should call those "free text" and model with with trigrams rather than categoricals. Maybe teach the "guess" heuristic to automatically ignore a column with too high a cardinality?

fsaad commented 9 years ago

It is also worth figuring out why crosscat operates slowly on categorical data (the component model dirichlet-discrete). I will take a look into this issue.

riastradh-probcomp commented 9 years ago

The amount of space required by Crosscat for a categorical column of n values, with m models, having an average of c clusters in the view covering the category, is O(n m c).

We could possibly make it sparser, so that for a categorical column of n values, with m models, having an average of c clusters in the view covering the category /and/ an average of s values represented per cluster, the amount of space required is O(n + m c s).

In fact, I did that back in April: commit crosscat@2c4d8fc62663858516a76e9a6c0640830c3e5aad. But it turned out to slow things down too much, so I reverted it in commit crosscat@d25ad140ea303f7b732d471e05ce44c00c7cae8e. I don't remember the measurement that led to my reverting it -- probably @BaxterEaves just hollered across the room that my commit turned Crosscat into molasses.

So it may be worthwhile to reconsider the sparse representation, perhaps with a heuristic about the relative values of n and s.

This is independent of any heuristics about free text or about the sparseness of the categorical column overall.

gregory-marton commented 9 years ago

ngrams are a poor model of most anything textual, they're just a better model than anything else we seem to have. :-)

More seriously, I would try to cluster these empirically (using values in other cells), with dirichlet prior on the number of clusters, substitute their cluster for the categorical item itself, and be ready to report the clustering as an interesting output of our process. This gets interesting when there are multiple such columns, so you have to do the clustering simultaneously/jointly. I'd love to eventually add text-specific similarity measures to the model, like semantic vector space similarity a la word2vec, or like phonetic similarity, or like typographic similarity (for ocr), or like text-entry similarity (for mistyping and damnyouautocorrect situations), and perhaps ngrams eventually as a proxy for word order mattering, though something like a crf-lda strikes me as more sensible there, but I'd like to have a generic solution for high-cardinality categoricals before going for implementing text-specific solutions.

tibbetts commented 9 years ago

Ignore the digression from Grem and I. :)

I think we should teach GUESS to measure cardinality and say "IGNORE" for cardinality above 1000.

marcoct commented 9 years ago

This is probably a digression, but it might be informative to think about the transformation to indicator variables, one for each possible category. This of course weakens the model by removing constraints, which could possibly be re-instantiated with some other enforcement mechanism, but I'm not sure how transformation would scale on the current implementation. I'm not suggesting this, but it may be relevant to keep in mind when addressing this issue.

gregory-marton commented 8 years ago

Let's separate the design discussions here from the task of actually guessing IGNORE on large categoricals, implement that asap, and then come back to what to do with specific interesting kinds of large categoricals (as described above) later?

The relevant code is in https://github.com/probcomp/bayeslite/blob/master/src/guess.py

gregory-marton commented 8 years ago

@raxraxraxraxrax, can you do some of the separation of thoughts here into a new document, a researchy question for after the analysis release.

riastradh-probcomp commented 8 years ago

I started a draft of this a couple months ago on the 20150919-riastradh-guesscatlimit branch:

https://github.com/probcomp/bayeslite/tree/20150919-riastradh-guesscatlimit

gregory-marton commented 8 years ago

Do you recall why it remains in an unresolved state? Did you get busy with other things, or was there a problem with the draft?

vkmvkmvkmvkm commented 8 years ago

We definitely want a "data transform" layer that automates standard (re)encodings like this, by modifying the statistical types. E.g. instead of "categorical" we might have a subtype "categorical via indicators" or similar. ᐧ

On Tue, Oct 13, 2015 at 2:21 PM, Marco Cusumano-Towner < notifications@github.com> wrote:

This is probably a digression, but it might be informative to think about the transformation to indicator variables, one for each possible category. This of course weakens the model by removing constraints, which could possibly be re-instantiated with some other enforcement mechanism, but I'm not sure how transformation would scale on the current implementation. I'm not suggesting this, but it may be relevant to keep in mind when addressing this issue.

— Reply to this email directly or view it on GitHub https://github.com/probcomp/bayeslite/issues/113#issuecomment-147801908.

riastradh-probcomp commented 8 years ago

@gregory-marton: I don't remember why it remains unresolved. I probably just got busy. It probably doesn't have tests and the number 1000 pulled out of my arse could probably stand some more thought than I gave it.

gregory-marton commented 8 years ago

https://github.com/probcomp/bayeslite/commit/8b87fd10cbb7bbb85badbfb15e88ab470757aa81

I chose "large" to mean that above a given ratio is distinct. If there are 2000 categories, but also 10000 examples of each, then we may be able to glean quite a lot from it, and not fall down this way, right?