lithops-cloud / lithops

A multi-cloud framework for big data analytics and embarrassingly parallel jobs, that provides an universal API for building parallel applications in the cloud ☁️🚀
http://lithops.cloud
Apache License 2.0
320 stars 106 forks source link

Standalone stopped working (single VM) #516

Closed JosepSampe closed 3 years ago

JosepSampe commented 3 years ago

Standalone mode stopped working when using a single VM. The first time you use a VM (or after running lithops clean), Lithops tries to extract the metadata information, but the VM is never started, getting stuck forever at that point.

2021-01-08 16:29:07,860 [INFO] lithops.config -- Lithops v2.2.15.dev0
2021-01-08 16:29:07,860 [DEBUG] lithops.config -- Loading configuration
2021-01-08 16:29:07,881 [DEBUG] lithops.config -- Loading Standalone backend module: ibm_vpc
2021-01-08 16:29:07,981 [DEBUG] lithops.config -- Loading Storage backend module: ibm_cos
2021-01-08 16:29:07,982 [DEBUG] lithops.storage.backends.ibm_cos.ibm_cos -- Creating IBM COS client
2021-01-08 16:29:07,982 [DEBUG] lithops.storage.backends.ibm_cos.ibm_cos -- Set IBM COS Endpoint to https://s3.us-east.cloud-object-storage.appdomain.cloud
2021-01-08 16:29:07,982 [DEBUG] lithops.storage.backends.ibm_cos.ibm_cos -- Using access_key and secret_key
2021-01-08 16:29:08,005 [INFO] lithops.storage.backends.ibm_cos.ibm_cos -- IBM COS Storage client created - Region: us-east
2021-01-08 16:29:08,005 [DEBUG] lithops.standalone.backends.ibm_vpc.ibm_vpc -- Creating IBM VPC client
2021-01-08 16:29:08,052 [INFO] lithops.standalone.backends.ibm_vpc.ibm_vpc -- IBM VPC client created - Region: us-east - Host: 52.117.125.137
2021-01-08 16:29:08,052 [DEBUG] lithops.standalone.standalone -- Standalone handler created successfully
2021-01-08 16:29:08,052 [DEBUG] lithops.invokers -- ExecutorID 330909-0 - Total available workers: None
2021-01-08 16:29:08,052 [INFO] lithops.executors -- Standalone Executor created with ID: 330909-0
2021-01-08 16:29:08,052 [INFO] lithops.invokers -- ExecutorID 330909-0 | JobID M000 - Selected Runtime: python3.8 
2021-01-08 16:29:08,052 [DEBUG] lithops.storage.storage -- Runtime metadata not found in local cache. Retrieving it from storage
2021-01-08 16:29:08,052 [DEBUG] lithops.storage.storage -- Trying to download runtime metadata from: ibm_cos://lithops-data-us-east/lithops.runtimes/2.2.15.dev0/python3.8.meta.json
2021-01-08 16:29:08,829 [DEBUG] lithops.storage.storage -- Runtime metadata not found in storage
2021-01-08 16:29:08,829 [INFO] lithops.invokers -- Runtime python3.8 is not yet installed
2021-01-08 16:29:08,830 [DEBUG] lithops.standalone.standalone -- Extracting runtime metadata information
gilv commented 3 years ago

@JosepSampe hm...checking

gilv commented 3 years ago

@JosepSampe it works for me...was it exec_mode: create and Lithops created VM or you tried it with existing VM you had before?

JosepSampe commented 3 years ago

I have an already created VM with a public IP address. I wanted to use it as I used it 1 mont ago, automatically starting/stopping the VM when needed. Seems now standalone for IBM VPC has 2 different modes. In my case, I followed these instructions: https://github.com/lithops-cloud/lithops/blob/master/config/compute/ibm_vpc.md#lithops-in-a-standalone-mode , and the VM is not started

gilv commented 3 years ago

@JosepSampe let me check this

gilv commented 3 years ago

@JosepSampe done

JosepSampe commented 3 years ago

@gilv Seems now it starts te VM but the proxy is not installed. I think it misses a call to the setup_proxy method after starting the VM

gilv commented 3 years ago

@JosepSampe right! sorry...i wonder how it then worked for me... fixing it

gilv commented 3 years ago

@JosepSampe i just pushed commit with the fix

JosepSampe commented 3 years ago

@gilv thanks, testing it

JosepSampe commented 3 years ago

I'm now getting this exception:

  File "/home/josep/data/dev-workspace/lithops/lithops/lithops/standalone/standalone.py", line 263, in create_runtime
    self._setup_proxy(backend)
  File "/home/josep/data/dev-workspace/lithops/lithops/lithops/standalone/standalone.py", line 356, in _setup_proxy
    ssh_client.upload_data_to_file(ip_address, PROXY_SERVICE_FILE, service_file)
  File "/home/josep/data/dev-workspace/lithops/lithops/lithops/util/ssh_client.py", line 57, in upload_data_to_file
    ftp_client = self.ssh_client.open_sftp()
  File "/usr/lib/python3/dist-packages/paramiko/client.py", line 556, in open_sftp
    return self._transport.open_sftp_client()
AttributeError: 'NoneType' object has no attribute 'open_sftp_client'
gilv commented 3 years ago

@JosepSampe checking..

gilv commented 3 years ago

@JosepSampe I actually not sure what happend..

   def upload_data_to_file(self, ip_address, data, remote_dst, timeout=None):
        if self.ssh_client == None:
            self.ssh_client = self.create_client(ip_address, timeout)

        ftp_client = self.ssh_client.open_sftp()

        with ftp_client.open(remote_dst, 'w') as f:
            f.write(data)

        ftp_client.close()

and self.ssh_client initiated SSHClient: <paramiko.client.SSHClient object at 0x10b6879a0> then this exception

  ssh_client.upload_data_to_file(ip_address, PROXY_SERVICE_FILE, service_file)
  File "/Users/gilv/.pyenv/versions/3.8.5/lib/python3.8/site-packages/lithops/util/ssh_client.py", line 57, in upload_data_to_file
    ftp_client = self.ssh_client.open_sftp()
  File "/Users/gilv/.pyenv/versions/3.8.5/lib/python3.8/site-packages/paramiko/client.py", line 556, in open_sftp
    return self._transport.open_sftp_client()
AttributeError: 'NoneType' object has no attribute 'open_sftp_client'
JosepSampe commented 3 years ago

I now see that here you call backend.start https://github.com/lithops-cloud/lithops/blob/master/lithops/standalone/standalone.py#L262 and I think it should be self._start_backend(backend)

gilv commented 3 years ago

@JosepSampe ohhh....it because of that strange delay, VM started but it's not possible to ssh it right away, so we need to check it started, etc. Fixing it.. thanks :)

gilv commented 3 years ago

@JosepSampe check it now :) I just pushed the fix

JosepSampe commented 3 years ago

@gilv Thanks. I will test it

JosepSampe commented 3 years ago

Seems I still have the same issue.

I finally dig into the code and found that the current _wait_backend_ready() method is not correct enough. In this method, you are relying on the backend.is_ready() method to detect if the VM instance is running. However, in this case IBM VPC lib returns that the VM is running 20, 30 or even 40 seconds before you can start running commands. This is because from the ibm-vpc lib you get a message that the VM is running, but this does not means that the operating system is loaded and ready to receive commands. I can see that you put a hardcoded sleep of 20 seconds here to overcome this issue, but this sleep most of times is not enough, causing all ssh commands to fail during the proxy installation process.

I updated the method to detect the exact time the VM instance is ready to receive commands, this way we don't need hardcoded sleeps.

With this PR everything is working now for me. Check if it works for you too. #523