pangeo-data / pangeo-stacks

Curated Docker images for use with Jupyter and Pangeo
https://pangeo-data.github.io/pangeo-stacks/
BSD 3-Clause "New" or "Revised" License
17 stars 20 forks source link

Can I run one of these docker images in a standalone cloud VM? #82

Open rabernat opened 5 years ago

rabernat commented 5 years ago

I just tried and failed to run the pangeo-notebook image in a standalone, hand-made google compute instance. I tried following the Google Cloud docs, but whenever my VM booted up, I was "ryan_abernathey" instead of "jovyan" and couldn't find any of the familiar environment or commands.

Is this possible? If so, could we provide some instructions for how to manually boot up an image and connect to the notebook server?

jhamman commented 5 years ago

@rabernat - can you share the gcloud command you ran? I think this would be a cool feature to have so its worth looking into.

I'm guessing gcloud is doing some fancy stuff to get your user name in there and this is where things go wrong. I wonder if there is anyway to run the image as jovyan?

yuvipanda commented 5 years ago

I too am interested in making this happen! Do share the commands you used, @rabernat.

rabernat commented 5 years ago

I did everything through the console like a n00b.

image

In that box I tried various things, like

https://hub.docker.com/r/pangeo/base-notebook

or

pangeo/base-notebook:2019.09.21

The containers all launched, but when I ssh'd in, I did not find my familiar environment. Instead it was just a vanilla VM.

rabernat commented 5 years ago

I figured out the command-line version of what I was doing based on the docs

gcloud compute instances create-with-container pangeo-etl-1 --container-image pangeo/base-notebook:2019.09.21

When I ssh to this container, I don't see anything I expect from the pangeo environment.

rabernat commented 5 years ago

After more digging, it appears that the container is indeed running:

$ docker ps
CONTAINER ID        IMAGE                                                                COMMAND                  CREATED             STATUS              PORTS               NAMES
602a9c2c3970        pangeo/base-notebook:2019.09.21                                      "/usr/local/bin/repo…"   2 minutes ago       Up 2 minutes                            klt-pangeo-etl-1-bddu
49be272b0b7f        gcr.io/stackdriver-agents/stackdriver-logging-agent:0.2-1.5.33-1-1   "/entrypoint.sh /usr…"   3 minutes ago       Up 3 minutes                            stackdriver-logging-agent

but it doesn't seem to have any open ports. I don't know how to connect to it.

yuvipanda commented 5 years ago

Yeah, you need to do something around https://cloud.google.com/compute/docs/containers/configuring-options-to-run-containers?hl=en_US&_ga=2.220013401.-1348765089.1507054149#publishing_container_ports to make the ports open to the internet. We'll also need to figure out some kinda token security

yuvipanda commented 5 years ago

If you do docker exec -it 602a9c2c3970 /bin/bash it should put you inside the container. You can also do docker logs 602a9c2c3970 to get the logs, which will include the notebook authentication token.

I got the hash from under 'CONTAINER ID' in your output

rabernat commented 5 years ago

Great! That stuff works!

What I really want is to connect to the notebook server, ideally in a secure way.

Just for some context, what I am trying to do here is build an ETL pipeline for moving datasets to the cloud. Currently we are staging datasets on local servers, converting them to zarr, and uploading to cloud storage. That seems silly. We should just do it all in the cloud. For that, I need a big VM with a big persistent disk attached to it!

yuvipanda commented 5 years ago

@rabernat I've spent some time writing a small script that chains together the commands to do that.

https://gist.github.com/yuvipanda/cb918977ba8db42c93f3db726e5cbca4

You can run it as:

python3 start-vm.py <vm-name>

And it'll spin around for a bit, and give you a URL (with a secret token) you can hit to have access to your notebook server! Even though it isn't completely insecure (because we use an access token), there's no HTTPS, so I wouldn't consider it 'secure' - we can probably use a SSH Tunnel to make it more 'secure'. But this is an evening hack...

It's simple-ish, but could use a lot of docs + comments. It can also grow a lot of options - persistent disk access, automatic scopes to access GCS, VM size, etc. We can also make this into a 'real' software project, that can then be used to work on AWS, Azure, etc.

I probably won't have time to push on this, but would be happy to review code & answer questions if someone else does :)

rabernat commented 5 years ago

Thanks so much @yuvipanda!