googledatalab / datalab

Interactive tools and developer experiences for Big Data on Google Cloud Platform.
Apache License 2.0
975 stars 249 forks source link

Datalab won't connect to VM instance after long time waiting for it to be reachable at port 8081 #2124

Open miguel2488 opened 5 years ago

miguel2488 commented 5 years ago

Hi,

i've been working on this for days, and have read a lot in google about this issue. Although i couldn't find anything to help me solving it.

The case is that i have created a datalab instance via the gcloud shell like this:

datalab create --image-name c2-deeplearning-tf-1-13-cu100-20190227 --disk-size-gb 100 --machine-type n1-standard-8 my-instance --network-name my-net-01 --zone europe-west1-b

it all works fine, i'm asked to create a passphrase, rsa keys are propagated and then, i got this message of death:

Waiting for Datalab to be reachable at http://localhost:8081/

I can SSH to the vm instance using the button to the right, or using gcloud compute ssh instance. No problems with that.

Running the datalab connect command passing --ssh-log-level=debug i got thousands of messages like this one:

Captura

It walks through all the ports trying to connect to the 8081 port but it never succeeds, so finally after a long waiting, i get this message:

connection closed attempting to reconnect

and the whole process starts again from the beginning.

This is a screenshot of my firewall rules:

Captura

i think everything is ok here. What am i missing?? Where's the problem?? Can someone help please? i've been stuck here for over a week now, any help will be much appreciated.

Thank you very much in advance.

antellgc commented 5 years ago

Having the same problems here. @miguel2488 have you had any luck with a fix?

miguel2488 commented 5 years ago

Nope, nothing new here, i wasn't able to fix it since i don't have a clue about where the problem is coming. Instead of using datablab, i resigned myself to run jupyter notebooks on the machine, i'm totally blind with this and for what i've seen so far, no one seems to care about this thread. I wish you a better luck.

hacktuarial commented 5 years ago

I had the same problem, and observed that the container running jupyter on the VM took ~5 minutes to start up. My workaround was:

MchlUh commented 4 years ago

Hello hacktuarial, I have the same issue, and tried your solution. the datalab container never appears for me. Did you simply run cloud compute ssh ...(name of instance) ? Thanks for your help !

hacktuarial commented 4 years ago

Yes, that's what I ran. Can you post a sample of your ssh logs? It sounds like the problem may be with the datalab create command.

MchlUh commented 4 years ago

I was using a datalab connect ... command until now, and tried really with datalab create .... It actually works exactly as you said, the loggers and datalab containers appeared !

It has maybe something to do with the way I created my instance at the beginning, I used: datalab beta create-gpu datalab-instance-name at the time.

Anyway, I am now able to use Datalab ! Thanks :)

MchlUh commented 4 years ago

It seems that when creating an instance with a GPU, the same problem appears but this solution does not apply. I have now created it for an hour, and docker ps only shows the logger container but no datalab container.

chanyou0311 commented 4 years ago

I have a similar problem with @MichaelTheBrute.

I tried to launch an instance of Datalab with the command below.

$ datalab beta create-gpu --machine-type n1-standard-4 --zone us-west1-b --accelerator-type nvidia-tesla-k80 --accelerator-count 1 datalab-instance
By accepting below, you will download and install the
following third-party software onto your managed GCE instances:
    NVidia GPU Driver: NVIDIA-Linux-x86_64-390.46
Do you accept (y/N)?: y
Creating the disk datalab-instance-pd
Creating the instance datalab-instance

Due to GPU Driver installation, please note that Datalab GPU instances take significantly longer to startup compared to non-GPU instances.
Created [https://www.googleapis.com/compute/beta/projects/xxxxxxxx/zones/us-west1-b/instances/datalab-instance].
Connecting to datalab-instance.
This will create an SSH tunnel and may prompt you to create an rsa key pair. To manage these keys, see https://cloud.google.com/compute/docs/instances/adding-removing-ssh-keys
Waiting for Datalab to be reachable at http://localhost:8081/

However, there is no response after more than 30 minutes. I saw information that it took about 15 minutes, but I thought it was still too long.

I made an ssh connection to the instance and started investigating. As discussed before, I also ran the docker ps command.

$ datalab@datalab-instance ~ $ sudo docker ps -a
CONTAINER ID        IMAGE                                         COMMAND                  CREATED             STATUS              PORTS               NAMES
4994361cf048        gcr.io/google-containers/fluentd-gcp:2.0.17   "/bin/sh -c '/run.sh…"   19 minutes ago      Up 19 minutes       80/tcp              logger

The datalab container was not running. However, when I ran the same command a few minutes later, I saw gcr.io/cloud-datalab/datalab-gpu:latest image only once. (I forgot to take notes.) Since then, we have never been able to see the container.

When the CPU worked correctly, I thought that it might be because the GPU was not set up correctly. The GPU setup seems to be done in the startup script, so I checked that the script finished successfully.

datalab@datalab-instance ~ $ systemctl status google-startup-scripts.service
● google-startup-scripts.service - Google Compute Engine Startup Scripts
   Loaded: loaded (/usr/lib/systemd/system/google-startup-scripts.service; disabled; vendor preset: disabled)
   Active: inactive (dead) since Tue 2020-02-11 07:27:30 UTC; 34min ago
 Main PID: 421 (code=exited, status=0/SUCCESS)
      CPU: 881ms

I checked the log with the journalctl command, but it seemed to have finished successfully.

In the process, I noticed that wait-for-startup-script.service did not finish properly.

datalab@datalab-instance ~ $ systemctl --failed
  UNIT                            LOAD   ACTIVE SUB    DESCRIPTION
● wait-for-startup-script.service loaded failed failed Wait for the startup script to setup required directories
datalab@datalab-instance ~ $ sudo journalctl -u wait-for-startup-script.service
-- Logs begin at Tue 2020-02-11 06:59:19 UTC, end at Tue 2020-02-11 08:05:27 UTC. --
Feb 11 06:59:34 datalab-instance systemd[1]: Starting Wait for the startup script to setup required directories...
Feb 11 06:59:34 datalab-instance docker-credential-gcr[768]: ERROR: Unable to save docker config: mkdir /root/.docker: read-only file system
Feb 11 06:59:34 datalab-instance systemd[1]: wait-for-startup-script.service: Control process exited, code=exited status=1
Feb 11 06:59:34 datalab-instance systemd[1]: wait-for-startup-script.service: Failed with result 'exit-code'.
Feb 11 06:59:34 datalab-instance systemd[1]: Failed to start Wait for the startup script to setup required directories.
Feb 11 06:59:34 datalab-instance systemd[1]: wait-for-startup-script.service: Consumed 82ms CPU time
Feb 11 06:59:34 datalab-instance systemd[1]: Starting Wait for the startup script to setup required directories...
Feb 11 06:59:34 datalab-instance docker-credential-gcr[792]: ERROR: Unable to save docker config: mkdir /root/.docker: read-only file system
Feb 11 06:59:34 datalab-instance systemd[1]: wait-for-startup-script.service: Control process exited, code=exited status=1
Feb 11 06:59:34 datalab-instance systemd[1]: wait-for-startup-script.service: Failed with result 'exit-code'.
Feb 11 06:59:34 datalab-instance systemd[1]: Failed to start Wait for the startup script to setup required directories.
Feb 11 06:59:34 datalab-instance systemd[1]: wait-for-startup-script.service: Consumed 94ms CPU time

You can confirm that an error has occurred in docker-credential-gcr. I don't understand what this means in the startup-script, but I hope it helps.

I will continue to investigate.

chanyou0311 commented 4 years ago

May be related to this Pull Request. https://github.com/googledatalab/datalab/pull/2147