logv / snorkel

UI for interactive data analysis | https://snorkel.logv.org
https://fb.com/groups/snorkelsnorkelsnorkel
161 stars 21 forks source link

[docker] add basic Dockerfile #28

Closed tmc closed 6 years ago

okayzed commented 6 years ago

i'm not familiar with docker - how do you test this and did it work? should the dockerfile also include the ability to bundle sybil too?

there's another repo that i've been using for building AppImage and Snap packages for snorkel+sybil: https://github.com/logv/snapcraft. snapcraft has a read-only package + a separate writeable dir, so i had to make changes to how snorkel works to accommodate that. not sure if docker has similar requirements.

prologic commented 6 years ago

No; sybil should be a separate image.

okayzed commented 6 years ago

i'm building it now - i also see where you got the npm audit log output.

despite not knowing docker (or its conventions), i have a Dockerfile I started for sybil, but didn't get far with: https://github.com/logv/sybil/tree/docker/docker, maybe it can be useful for you.

prologic commented 6 years ago

I've been following this project for a few weeks. I'll help with Dockerizing this :)

okayzed commented 6 years ago

@tmc: when i run the docker image locally (after doing a docker build), it errors with a missing htpasswd node module, but i think snorkel's npm-shrinkwrap lists htpasswd as a dep (maybe there is some misconfiguration in snorkel's npm shrinkwrap?)

@prologic perfect, thanks!

tmc commented 6 years ago

@okayzed sorry was missing the shrinkwrap file when I added separate cache utilization of package.json, this has been updated and tested.

okayzed commented 6 years ago

the image builds and i can run it. but there are no users, so i can't log in yet and test more stuff out. i have to look into where docker will store the app data and whether docker has the concept of read-only container + write area for user data. i'm also curious to see how sybil will get tied in

maybe some more stuff to do:

tmc commented 6 years ago

@okayzed I'll introduce a docker.js config. -- user management scripts can be reached via docker exec invocations. While sybil certainly should have it's own docker image(s) I think it'd be most appropriate for you do publish a binary release of sybil that can be pulled into the snorkel image.

tmc commented 6 years ago

@okayzed PTAL Assuming data_dir/config/users.htpasswd is populated correctly, a simple 'make run-image-tour' will start snorkel with a persistent sybil backend running at localhost:3000

prologic commented 6 years ago

I would really not be doing this at all. Don't expect to pull in sybil into the snorkel image. Keep things decoupled and keep "scalability" in mind. Don't back yourself into a corner like this; build the images seaprately and think about "scale".

okayzed commented 6 years ago

in terms of scaling, what i was picturing (based on scuba design) for distributed queries is something like:

i made some guesses around different type of machine setups here: http://logv.org/distributed-notes.html#example-cluster-setups

prologic commented 6 years ago

This sounds initially good to me. Is there support coming for multiple backends too at some point? Like Presto?

cheers James

James Mills / prologic

E: prologic@shortcircuit.net.au W: prologic.shortcircuit.net.au

On Tue, May 15, 2018 at 7:03 PM, okay notifications@github.com wrote:

in terms of scaling, what i was picturing (based on scuba design) for distributed queries is something like:

  • leaf nodes: many machines will have a sybil install, perhaps connected to the end of ingestion pipes (like scribe/kafka)
  • aggregator: one (or more) machine will have a sybil binary and will be used to coordinate queries (across the leaf machines) and stitch results back together
  • UI: one machine will have snorkel install with msybil binary. this machine connects to the master aggregator to run sybil queries. might be the same as the master aggregator machine at first. the UI can not be installed on multiple machines at the moment.

i made some guesses around different type of machine setups here: http://logv.org/distributed-notes.html#example-cluster-setups

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/logv/snorkel/pull/28#issuecomment-389371120, or mute the thread https://github.com/notifications/unsubscribe-auth/ABOv-ijb_ad-U_BJhaaRPO98y2y-YkYnks5ty4jogaJpZM4T8_Pz .

okayzed commented 6 years ago

to understand what you are asking: you want to be able to show multiple datasets from different sources in the snorkel UI? (f.e. we might have some datasets coming from presto and other datasets coming from sybil or mysql tables)

currently, i've only ever been able to support one backend at a time in snorkel but multiple simultaneous backends has been a requested use case in the past. i haven't thought too much about how to accomplish it, but it could be doable. (let's open an issue for it)

my main blocker with presto is that i don't have a presto setup to develop against (and i forget if presto is schemaless, maybe it has per block schemas), but i built a postgres and sqlite adapter that should be easy to copy and modify to work on top of presto, because those adapters build the queries as straight forward SQL. if i'm given a set of presto tables to run queries against (that has a time column in seconds), i can likely add a presto adapter.

prologic commented 6 years ago

Yes :) This is something I think is worth building (at least allowing for it).

Some backends to consider supporting:

cheers James

James Mills / prologic

E: prologic@shortcircuit.net.au W: prologic.shortcircuit.net.au

On Tue, May 15, 2018 at 7:13 PM, okay notifications@github.com wrote:

to understand what you are asking: you want to be able to show multiple datasets from different sources in the snorkel UI? (f.e. we might have some datasets coming from presto and other datasets coming from sybil or mysql tables)

currently, i've only ever been able to support one backend at a time in snorkel but multiple simultaneous backends has been a requested use case in the past. i haven't thought too much about how to accomplish it, but it could be doable. (let's open an issue for it)

my main blocker with presto is that i don't have a presto setup to develop against (and i forget if presto is schemaless, maybe it has per block schemas), but i built a postgres and sqlite adapter that should be easy to copy and modify to work on top of presto, because those adapters build the queries as straight forward SQL. if i'm given a set of presto tables to run queries against (that has a time column in seconds), i can likely add a presto adapter.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/logv/snorkel/pull/28#issuecomment-389372611, or mute the thread https://github.com/notifications/unsubscribe-auth/ABOv-tBWXaY_pFAdAP6igF6Mh6XSUJJZks5ty4tGgaJpZM4T8_Pz .

okayzed commented 6 years ago

@prologic: the mysbil binary has dependency on sybil binary, so i'm fine with including this dependency for now. the plan is to move it out, though - maybe we can all sync up on this offline (on fb chats or IRC)

@tmc: awesome! i tested it out - it works for me. i had to move to root of snorkel/ directory, then create data_dir/config/ and copy the docker.js and users.htpasswd into it, then i was able to log in and start browsing datasets it created. (maybe this can be automated) - anything else left?

tmc commented 6 years ago

@prologic given the execution model of sybil it's not trivial to run it in a separate container. I don't think the downsides outweigh the convenience this provides. Distribution and scale will come when sybil can be accessed over a network boundary (such as with msybil).

@okayzed you shouldn't need to copy the docker.js config into data_dir. In terms of additions perhaps having a make target that sets up a demo account or extending the add_user script to support writing in other locations would provide a nicer initial setup experience.

okayzed commented 6 years ago

if i don't copy docker.js into ${ROOT}/data_dir/config/, this is what i see at the beginning of the snorkel run:

WARNING: Running server without a $RELEASE - assets will be served in development mode!
CONFIG: LOADING CONFIG DIR FROM /snorkel
CONFIG: Using config in config/config
CONFIG: Adding overrides from config/override.js
CONFIG: Using ENV overrides in config/docker
CONFIG: LOADING CONFIG DIR FROM /var/data_dir
CONFIG: Main config error:Error: Cannot find module '/var/data_dir/config/config'
CONFIG: Couldn't load overrides in config/override.js:Error: Cannot find module '/var/data_dir/config/override'
CONFIG: Couldn't load env conf from config/docker:Error: Cannot find module '/var/data_dir/config/docker'

these are small things that can be taken care of later, though - up to you on how far you want to push this patch.

prologic commented 6 years ago

@tmc

@prologic given the execution model of sybil it's not trivial to run it in a separate container. I don't think the downsides outweigh the convenience this provides. Distribution and scale will come when sybil can be accessed over a network boundary (such as with msybil).

I don't understand. How is snorkel communicating with the sybil backend now? Surely it must be doing so over some kind of API (impliying a network boundary)

TBH though I haven't looked at the codebase too closely (yet); but I want to use sybil/snorkel in my infra.

okayzed commented 6 years ago

@prologic: sybil is a serverless binary that can ingest from stdin and query samples that are on disk. when the ingestion or query finishes, the binary exits. snorkel calls directly into this binary to ingest or execute queries on local data. when doing remote queries, snorkel uses msybil (a python script) to call out to sybil on other machines. the plan is to evolve into a more scalable form (as you were mentioning).

@tmc will likely be creating a grpc server to sit on top of sybil, s.t. sybil does appear like a regular backend service. then we will be able to arbitrarily add leaf nodes to the sybil swarm and execute queries from the main snorkel server that fan out.

prologic commented 6 years ago

@prologic: sybil is a serverless binary that can ingest from stdin and query samples that are on disk. when the ingestion or query finishes, the binary exits. snorkel calls directly into this binary to ingest or execute queries on local data. when doing remote queries, snorkel uses msybil (a python script) to call out to sybil on other machines. the plan is to evolve into a more scalable form (as you were mentioning).

Oh I see; Yeah we should extend this to support some kind of API. GRPC, HTTP, something.

tmc commented 6 years ago

@okayzed those warnings are displayed but the config is still read from config/ -- is it better to leave config_dir untouched? I was basing that off of the snapcraft config.

Maybe it's better to not use the label 'docker' since the important thing the docker config is doing is just pulling values from the environment. Is calling this config 'env' or 'environment' too generic/confusing?

okayzed commented 6 years ago

@tmc: config_dir/ is so that someone can configure the image externally, otherwise the config will be hardcoded to what was set at packaging time. up to you on what you want to use for the label - env is a tiny bit generic, but it's fine if you copy it over to data_dir/config/env.js or data_dir/config/config.js from snorkel/config/docker.js when packaging. that way, people actually modifying the config will not be confused by the name.

tmc commented 6 years ago

My intent was that data_dir is provided as a volume by the user which implies it's not something that could have files put into it at packaging time. Perhaps a good solution is to improve the error message to make it seem like less of an error but more like it's another path that is being checked for configuration?

As long as configuration is declarative in nature I think it should be provided via the environment vs a configuration file.

okayzed commented 6 years ago

@tmc: ah, i see - that makes sense re: the not being able to package the config into the volume at build time. how it is now is fine then. you can likely remove the config_dir setting from the config file and those lines will go away.

okayzed commented 6 years ago

i guess that's all that's left for accept? (removing config_dir line) and maybe we'll add some script to setup the data_dir/ volume later

tmc commented 6 years ago

@okayzed ok, pushed with config_dir removal and I pushed a branch here with the docker->env change -- you pick to include it or now: https://github.com/tmc/snorkel/commit/09d6c8775c1aeecef5159e78dac7c95aa362fe3a

okayzed commented 6 years ago

merged in 3c8550d19d71056802a616f27432d5fd232a97b5, i made some slight changes (added config_dir back in via process.env.CONFIG_DIR and npm install sqlite3 --build-from-source to fix #31)