logv / snorkel

UI for interactive data analysis | https://snorkel.logv.org
https://fb.com/groups/snorkelsnorkelsnorkel
161 stars 21 forks source link

Adding new backend #39

Open smoreinis opened 6 years ago

smoreinis commented 6 years ago

The docs say "As long as a DB supports filtering and 3 main query types: raw samples, group by string and group by integer, snorkel can run queries against it (snorkel just needs to be shown how)."

This sounds pretty promising for e.g. trying to set snorkel up to query an AWS Athena backend but I couldn't find anything suggesting how to show snorkel how to run the 3 main query types.

Am I not seeing some docs or a backends folder? Does connecting Snorkel to Athena sound like a reasonable idea? Would love some pointers here and I'd be happy to help document the steps in case this is helpful going forward.

okayzed commented 6 years ago

yes, it should be reasonable to use aws athena. perhaps the easiest driver to start with is postgres_raw.js (if athena supports SQL). it uses squel query builder to build queries but is still about 800 lines of code :-(

to set the driver to use, set backend in config/config.js to the name of the driver you are working on.

there is a skeleton for the driver in the same folder (called driver.js). i have not documented the expected format of the data for each query type, but i can do so (or help to). the order of writing the driver is probably:

smoreinis commented 6 years ago

thanks for such a quick response! Athena is ~ Presto 0.172, so it sounds like postgres_raw should be a good starting point.

I'll start digging and report back if I get stuck.

smoreinis commented 6 years ago

First bit of confusion (albeit a non-blocker): what's up with the driver validate function?

I see https://github.com/logv/snorkel/blob/master/snorkel/app/server/backend.js#L22 trying to invoke it on the driver but it seems to only have been defined in https://github.com/logv/snorkel/blob/master/snorkel/app/server/backends/driver.js#L104 (and even then it seems like the expectation is for it to be defined with predict_column_types below).

How does this work for e.g. postgres_raw or the other existing drivers that don't have validate defined at all?

okayzed commented 6 years ago

i think the intent behind validate() is that its a validation of the driver config run on startup - it can be used to check for connections to DB and verify config settings are setup properly.

for predict_column_types - you don't have to implement it. it's a utility function to guess column types from samples and is shared between a couple backend drivers. it should be moved to a seperate file, thanks for noticing.

smoreinis commented 6 years ago

Sorry if my previous message was confusing - I followed the steps in https://github.com/logv/snorkel/wiki/Installation and step 0 of getting snorkel running gave me the error (complaining about validate not being a valid function). I had this issue with both sybil and a copy of driver.js - although oddly enough after cleaning my node_modules and running npm install again it seems to have gone away and I'm getting snorkel to start successfully in both cases if I only export the driver (no predict_column_types) as the top level class in the second case.

okayzed commented 6 years ago

it's possible that the validate() message was a red herring and there was a hiccup with npm or that some npm module fails on the first build and works on the second.

i have been capturing queries so you can see the data going in and out of a driver: https://logv.org/~okay/snorkel/queries/

in the query_spec.params, notice that the important columns are usually baseview (one of samples, time, table or dist), cols (the integer columns to aggregate), dims (the string group by columns), start_ms, end_ms (millisecond timestamp filters), time_bucket and a few more.

smoreinis commented 6 years ago

Looking into it more, I think the driver not being the top-level export in driver.js was probably the only reproducible issue, and I had some trouble getting my config to change in the way I expected. Thanks for the pointer to the queries! That will be extremely helpful.