googlecolab / colabtools

Python libraries for Google Colaboratory
Apache License 2.0
2.21k stars 727 forks source link

Problem connecting Colab notebook to GCE VM instance #3741

Closed patrickmetzner closed 1 year ago

patrickmetzner commented 1 year ago

Problem connecting Colab notebook to Google Compute Engine VM instance using the "Connect to a custom GCE VM" option

Current behavior: After creating a VM instance with GCE, I am unable to connect a Colab Notebook to it. Looking at https://console.cloud.google.com/compute/instances it seems like the VM starts without problems, and I can also access it via SSH, but when I try to connect a Colab Notebook to it I receive the following message instantly: Unable to connect to the runtime.

I am experiencing this problem since May 26th, 2023.

Note: I have noticed this same problem on November 4th, 2022 and after 2-3 days everything went back to normal without any action from my part.

Expected behavior: I have been using the same machine image to create VM instances for over 6 months, and I have always been able to connect Colab Notebooks to them by using the Connect to a custom GCE VM option. This is the expected behavior.

Browser being used: Google Chrome

Additional context: Running docker ps via SSH I get the following output:

$ docker ps
CONTAINER ID   IMAGE                                             COMMAND                  CREATED         STATUS          PORTS     NAMES
68fa1efa4e9c   gcr.io/colab-datalab/datalab:baked                "/bin/sh -c /datalab…"   13 months ago   Up 51 minutes             k_default
ad366de98a6b   gcr.io/colab-datalab/tunnelbackend_binary:baked   "/tunnelbackend_bina…"   17 months ago   Up 51 minutes             tunnelbevm
86e65928c184   gcr.io/colab-datalab/kernel_manager_proxy:baked   "/kernel_manager_pro…"   17 months ago   Up 51 minutes             kmp_default

Running sudo systemctl status k_default via SSH I get the following output:

$ sudo systemctl status k_default
● k_default.service - kernel default docker container
   Loaded: loaded (/etc/systemd/system/k_default.service; static; vendor preset: disabled)
   Active: inactive (dead)

If I run docker rm k_default -f followed by sudo systemctl start k_default the container starts to run and I can successfully run nvidia-smi inside it, but that is still not enough to enable the connection with the Colab Notebooks.

sjgosai commented 1 year ago

It's possible that this is related to another issue regarding re-connecting to Custom VMs that have been stopped and started. There are similar observations there regarding the status of k_default.

cperry-goog commented 1 year ago

Thanks - we're looking into this. Duplicate of #3714

In the meantime you can try running our Docker container in a VM of your choosing (option 1 here): https://research.google.com/colaboratory/local-runtimes.html

sjgosai commented 1 year ago

@cperry-goog thank you for the update. Appreciate the team looking into these issues! Any guidance for recovering data from machines that are experiencing these issues?

Also, I tried to follow the directions you pointed to here but was unsuccessful in connecting Colab to my problem VM. I also repeated the instructions with a fresh Colab VM that I confirmed to work using the "Connect to a custom GCE VM" option, but couldn't get the "Connect to a local runtime" to work with that one either.

A possible typo in the instructions surrounding local VMs. The command in "Connecting to a runtime on another machine" section uses port 8888 on remote while I think the docker image connects to port 8080. Unfortunately, I tried switching the suggested ports around, but still couldn't get things to work.

cperry-goog commented 1 year ago

Connecting via Docker should be done with the Connect to a local runtime flow. What error do you encounter when you try that?

sjgosai commented 1 year ago

In a terminal window I've tried running:

Step 1. (local) gcloud compute ssh --zone "us-west1-b" "colab-5-vm" --project "my-project" -- -L 8888:localhost:8888

Step 2. (remote) docker run -p 127.0.0.1:9000:8888 us-docker.pkg.dev/colab-images/public/runtime which generates the following link, http://127.0.0.1:9000/?token=some0token0here.

Step 3. (local, chrome browser) On Colab, in the "backend URL" field, I type: http://localhost:8888/?token=some0token0here. Alternatively I've replaced localhost:8888 with all combos of {127.0.0.1, localhost}:{8888,9000}.

On terminal I see channel 4: open failed: connect failed: Connection refused and in Colab I see "Unable to connect to the runtime". Apologies if I'm misunderstanding how to organize the ports, and thank you for your help!

sebbov commented 1 year ago

A possible typo in the instructions surrounding local VMs. [...] gcloud compute ssh --zone "us-west1-b" "colab-5-vm" --project "my-project" -- -L 8888:localhost:8888

Port 8888 is given as an example in the doc ("For example, to forward port 8888 on your local machine to port 8888 on your Google Compute Engine instance").

When using the docker command (corrected from what you used, below), the port on your VM will be 9000, so use that. Using it also as the local listening port (i.e, end up with -L 9000:localhost:9000) should allow you to use the URL obtained in your Step 2, in Step 3. We can make this all a bit more clear, perhaps use port 9000 on the VM (and on the local end when port forwarding), for both (docker, Jupyter) options.

(remote) docker run -p 127.0.0.1:9000:8888 us-docker.pkg.dev/colab-images/public/runtime which generates the following link, http://127.0.0.1:9000/?token=some0token0here.

The doc tells you to use container port 8080, not 8888:

docker run -p 127.0.0.1:9000:8080 us-docker.pkg.dev/colab-images/public/runtime