Closed sergii-mamedov closed 2 years ago
I ran some tests with Lithops from scratch and for now I'm unable to replicate the issue.
TimeoutError: Readiness probe expired on VM instance metaspace-staging-vpc-128 (161.156.169.205)
Initially, lithops tries to create an ssh connection to the VM and then it proceeds to install all the necessary to it. This message means lithops is unable to ssh to the VM, I think the code raises the exception on this line, so it is normal that it never installs the service there.
Could you verify that the VM is started? and the ssh key is correct?
We recently added a new logic that verifies the ssh keys in IBM VPC are correct: https://github.com/lithops-cloud/lithops/blob/master/lithops/standalone/backends/ibm_vpc/ibm_vpc.py#L559-L577. Might it fail here?
Could you run a simple test by changing this line and add self.backend.master.wait_ready(verbose=True)
?
Also, In your logs lithops seems to be set to INFO logging, could you set it to DEBUG?
We did not change the ssh key and I tested the ability to manually connect via ssh. I enable verbose=True
and DEBUG level for logging. How else can I figure out what the problem is?
2021-11-25 11:17:15,216 - INFO - lithops.config[Thread-1] - config.py:131 - Lithops v2.5.7
2021-11-25 11:17:15,216 - DEBUG - lithops.config[Thread-1] - config.py:212 - Loading Standalone backend module: ibm_vpc
2021-11-25 11:17:15,216 - DEBUG - lithops.config[Thread-1] - config.py:254 - Loading Storage backend module: ibm_cos
2021-11-25 11:17:15,216 - DEBUG - lithops.storage.backends.ibm_cos.ibm_cos[Thread-1] - ibm_cos.py:37 - Creating IBM COS client
2021-11-25 11:17:15,217 - DEBUG - lithops.storage.backends.ibm_cos.ibm_cos[Thread-1] - ibm_cos.py:57 - Set IBM COS Endpoint to https://s3.eu-de.cloud-object-storage.appdomain.cloud
2021-11-25 11:17:15,217 - DEBUG - lithops.storage.backends.ibm_cos.ibm_cos[Thread-1] - ibm_cos.py:60 - Using access_key and secret_key
2021-11-25 11:17:15,222 - INFO - lithops.storage.backends.ibm_cos.ibm_cos[Thread-1] - ibm_cos.py:97 - IBM COS client created - Region: eu-de
2021-11-25 11:17:15,222 - DEBUG - lithops.standalone.backends.ibm_vpc.ibm_vpc[Thread-1] - ibm_vpc.py:46 - Creating IBM VPC client
2021-11-25 11:17:15,222 - DEBUG - lithops.standalone.backends.ibm_vpc.ibm_vpc[Thread-1] - ibm_vpc.py:56 - Setting VPC endpoint to: https://eu-de.iaas.cloud.ibm.com
2021-11-25 11:17:15,223 - INFO - lithops.standalone.backends.ibm_vpc.ibm_vpc[Thread-1] - ibm_vpc.py:75 - IBM VPC client created - Region: eu-de
2021-11-25 11:17:15,224 - DEBUG - lithops.standalone.standalone[Thread-1] - standalone.py:60 - Standalone handler created successfully
2021-11-25 11:17:15,224 - DEBUG - lithops.invokers[Thread-1] - invokers.py:92 - ExecutorID aecfec-3 - Invoker initialized. Max workers: 1
2021-11-25 11:17:15,225 - DEBUG - lithops.standalone.backends.ibm_vpc.ibm_vpc[Thread-1] - ibm_vpc.py:250 - Initializing IBM VPC backend (consume mode)
2021-11-25 11:17:15,225 - DEBUG - lithops.executors[Thread-1] - executors.py:162 - Function executor for ibm_vpc created with ID: aecfec-3
2021-11-25 11:17:15,225 - DEBUG - lithops.config[Thread-1] - config.py:254 - Loading Storage backend module: ibm_cos
2021-11-25 11:17:15,225 - DEBUG - lithops.storage.backends.ibm_cos.ibm_cos[Thread-1] - ibm_cos.py:37 - Creating IBM COS client
2021-11-25 11:17:15,225 - DEBUG - lithops.storage.backends.ibm_cos.ibm_cos[Thread-1] - ibm_cos.py:57 - Set IBM COS Endpoint to https://s3.eu-de.cloud-object-storage.appdomain.cloud
2021-11-25 11:17:15,225 - DEBUG - lithops.storage.backends.ibm_cos.ibm_cos[Thread-1] - ibm_cos.py:60 - Using access_key and secret_key
2021-11-25 11:17:15,229 - INFO - lithops.storage.backends.ibm_cos.ibm_cos[Thread-1] - ibm_cos.py:97 - IBM COS client created - Region: eu-de
2021-11-25 11:17:15,230 - DEBUG - lithops.config[Thread-1] - config.py:254 - Loading Storage backend module: ibm_cos
2021-11-25 11:17:15,230 - DEBUG - lithops.storage.backends.ibm_cos.ibm_cos[Thread-1] - ibm_cos.py:37 - Creating IBM COS client
2021-11-25 11:17:15,230 - DEBUG - lithops.storage.backends.ibm_cos.ibm_cos[Thread-1] - ibm_cos.py:57 - Set IBM COS Endpoint to https://s3.eu-de.cloud-object-storage.appdomain.cloud
2021-11-25 11:17:15,230 - DEBUG - lithops.storage.backends.ibm_cos.ibm_cos[Thread-1] - ibm_cos.py:60 - Using access_key and secret_key
2021-11-25 11:17:15,233 - INFO - lithops.storage.backends.ibm_cos.ibm_cos[Thread-1] - ibm_cos.py:97 - IBM COS client created - Region: eu-de
2021-11-25 11:17:15,968 - INFO - engine[Thread-1] - job.py:67 - Storing job metadata
2021-11-25 11:17:16,583 - INFO - engine.lithops-wrapper[Thread-1] - executor.py:205 - executor.map(_load_ds, 1 items, 32768MB, attempt 1)
2021-11-25 11:17:16,584 - DEBUG - engine.lithops-wrapper[Thread-1] - executor.py:331 - Selected executor ibm_vpc
2021-11-25 11:17:16,590 - INFO - lithops.invokers[Thread-1-ex] - invokers.py:112 - ExecutorID aecfec-3 | JobID M000 - Selected Runtime: metaspace2020/metaspace-lithops:1.9.3
2021-11-25 11:17:16,591 - DEBUG - lithops.storage.storage[Thread-1-ex] - storage.py:397 - Runtime metadata not found in local cache. Retrieving it from storage
2021-11-25 11:17:16,591 - DEBUG - lithops.storage.storage[Thread-1-ex] - storage.py:400 - Trying to download runtime metadata from: ibm_cos://metaspace-env-staging-temp/lithops.runtimes/2.5.7/ibm_vpc/02c7_ad2243d5-20c6-4d41-bc24-cf0e316dea32/metaspace2020-metaspace-lithops-1.9.3.meta.json
2021-11-25 11:17:16,809 - DEBUG - lithops.storage.storage[Thread-1-ex] - storage.py:419 - Runtime metadata not found in storage
2021-11-25 11:17:16,809 - INFO - lithops.invokers[Thread-1-ex] - invokers.py:120 - Runtime metaspace2020/metaspace-lithops:1.9.3 is not yet installed
2021-11-25 11:17:16,809 - DEBUG - lithops.standalone.standalone[Thread-1-ex] - standalone.py:337 - Checking if VM instance metaspace-staging-vpc-128 (161.156.169.205) is ready
2021-11-25 11:17:18,209 - DEBUG - lithops.standalone.backends.ibm_vpc.ibm_vpc[Thread-1-ex] - ibm_vpc.py:773 - Starting VM instance metaspace-staging-vpc-128
2021-11-25 11:17:19,440 - DEBUG - lithops.standalone.backends.ibm_vpc.ibm_vpc[Thread-1-ex] - ibm_vpc.py:783 - VM instance metaspace-staging-vpc-128 started successfully
2021-11-25 11:17:19,440 - DEBUG - lithops.standalone.backends.ibm_vpc.ibm_vpc[Thread-1-ex] - ibm_vpc.py:614 - Waiting VM instance metaspace-staging-vpc-128 (161.156.169.205) to become ready
Do not appear any other line after the last line?
Try removing all this block: https://github.com/lithops-cloud/lithops/blob/master/lithops/standalone/backends/ibm_vpc/ibm_vpc.py#L559-L577
Yes, this is the last line from the lithops. Next comes an exception from our code, as in the example above.
Try removing all this block: https://github.com/lithops-cloud/lithops/blob/master/lithops/standalone/backends/ibm_vpc/ibm_vpc.py#L559-L577
In this case everything is fine:
2021-11-25 11:34:29,889 - INFO - engine[Thread-1] - job.py:67 - Storing job metadata
2021-11-25 11:34:30,574 - INFO - engine.lithops-wrapper[Thread-1] - executor.py:205 - executor.map(_load_ds, 1 items, 32768MB, attempt 1)
2021-11-25 11:34:30,574 - DEBUG - engine.lithops-wrapper[Thread-1] - executor.py:331 - Selected executor ibm_vpc
2021-11-25 11:34:30,581 - INFO - lithops.invokers[Thread-1-ex] - invokers.py:112 - ExecutorID aad558-1 | JobID M000 - Selected Runtime: metaspace2020/metaspace-lithops:1.9.3
2021-11-25 11:34:30,581 - DEBUG - lithops.storage.storage[Thread-1-ex] - storage.py:397 - Runtime metadata not found in local cache. Retrieving it from storage
2021-11-25 11:34:30,581 - DEBUG - lithops.storage.storage[Thread-1-ex] - storage.py:400 - Trying to download runtime metadata from: ibm_cos://metaspace-env-staging-temp/lithops.runtimes/2.5.7/ibm_vpc/02c7_ad2243d5-20c6-4d41-bc24-cf0e316dea32/metaspace2020-metaspace-lithops-1.9.3.meta.json
2021-11-25 11:34:30,876 - DEBUG - lithops.storage.storage[Thread-1-ex] - storage.py:419 - Runtime metadata not found in storage
2021-11-25 11:34:30,876 - INFO - lithops.invokers[Thread-1-ex] - invokers.py:120 - Runtime metaspace2020/metaspace-lithops:1.9.3 is not yet installed
2021-11-25 11:34:30,876 - DEBUG - lithops.standalone.standalone[Thread-1-ex] - standalone.py:337 - Checking if VM instance metaspace-staging-vpc-128 (161.156.169.205) is ready
2021-11-25 11:34:31,163 - DEBUG - lithops.util.ssh_client[Thread-1-ex] - ssh_client.py:37 - 161.156.169.205 ssh client created
2021-11-25 11:34:31,632 - INFO - lithops.standalone.standalone[Thread-1-ex] - standalone.py:414 - Installing Lithops in VM instance metaspace-staging-vpc-128 (161.156.169.205)
2021-11-25 11:34:31,633 - DEBUG - lithops.utils[Thread-1-ex] - utils.py:227 - Creating function handler zip in /opt/dev/metaspace/metaspace/engine/lithops_standalone.zip
2021-11-25 11:34:31,682 - DEBUG - lithops.standalone.standalone[Thread-1-ex] - standalone.py:421 - Uploading lithops files to VM instance metaspace-staging-vpc-128 (161.156.169.205)
2021-11-25 11:34:32,075 - DEBUG - lithops.standalone.standalone[Thread-1-ex] - standalone.py:431 - Executing lithops installation process on VM instance metaspace-staging-vpc-128 (161.156.169.205)
2021-11-25 11:34:32,075 - DEBUG - lithops.standalone.standalone[Thread-1-ex] - standalone.py:432 - Be patient, initial installation process may take up to 3 minutes
2021-11-25 11:34:34,530 - DEBUG - lithops.standalone.standalone[Thread-1-ex] - standalone.py:101 - Validating lithops version installed on master matches 2.5.7
2021-11-25 11:34:34,643 - DEBUG - lithops.standalone.standalone[Thread-1-ex] - standalone.py:123 - Validating lithops lithops master service is running on VM instance metaspace-staging-vpc-128 (161.156.169.205)
2021-11-25 11:34:34,807 - INFO - lithops.standalone.standalone[Thread-1-ex] - standalone.py:140 - Waiting Lithops service to become ready on VM instance metaspace-staging-vpc-128 (161.156.169.205)
2021-11-25 11:34:34,981 - DEBUG - lithops.standalone.standalone[Thread-1-ex] - standalone.py:146 - VM instance metaspace-staging-vpc-128 (161.156.169.205) ready in 0.17 seconds
2021-11-25 11:34:34,981 - DEBUG - lithops.standalone.standalone[Thread-1-ex] - standalone.py:345 - Extracting runtime metadata information
2021-11-25 11:34:50,538 - DEBUG - lithops.storage.storage[Thread-1-ex] - storage.py:430 - Uploading runtime metadata to: ibm_cos://metaspace-env-staging-temp/lithops.runtimes/2.5.7/ibm_vpc/02c7_ad2243d5-20c6-4d41-bc24-cf0e316dea32/metaspace2020-metaspace-lithops-1.9.3.meta.json
2021-11-25 11:34:50,903 - DEBUG - lithops.storage.backends.ibm_cos.ibm_cos[Thread-1-ex] - ibm_cos.py:121 - PUT Object lithops.runtimes/2.5.7/ibm_vpc/02c7_ad2243d5-20c6-4d41-bc24-cf0e316dea32/metaspace2020-metaspace-lithops-1.9.3.meta.json - Size: 6.6KiB - OK
2021-11-25 11:34:50,904 - DEBUG - lithops.storage.storage[Thread-1-ex] - storage.py:436 - Storing runtime metadata into local cache: /home/ubuntu/.lithops/cache/lithops.runtimes/2.5.7/ibm_vpc/02c7_ad2243d5-20c6-4d41-bc24-cf0e316dea32/metaspace2020-metaspace-lithops-1.9.3.meta.json
@kpavel @JosepSampe do you know what happens?
@gilv This logic is wrong and raises an exception even if the ssh keys are correct. This was not added by me, so now I'm not sure why it fails. For now I disabled it in #860 until we found an alternative.
@sergii-mamedov What operating system do you have? Can you please check if you have ssh-keygen
available there?
@kpavel
# Ubuntu 16.04.6
$ uname -a
Linux 4.4.0-1128-aws #142-Ubuntu SMP Fri Apr 16 12:42:33 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux LTS
$ which ssh-keygen
/usr/bin/ssh-keygen
FYI: We are using non default ssh key: https://github.com/metaspace2020/metaspace/blob/master/metaspace/engine/conf/config.json.template#L86-L90
@sergii-mamedov The code @JosepSampe removed is very important. It actually came to solve the issue when user has wrong ssh keypair configuration, validating that private/public keys are real pair and if not -> raise error for user with self explaining error message. In case it been failing in that section i would really prefer to solve it, but i may need your help as i can't reproduce it in my environment.
If you run ssh-keygen -y -f /home/ubuntu/.ssh/ibm_cloud_vpc
, do you get output identical to your public key?
If you run ssh-keygen -y -f /home/ubuntu/.ssh/ibm_cloud_vpc, do you get output identical to your public key?
Yep, except comment after ==
on the end of the line (in you case)
@sergii-mamedov Ok... I think i may found why the original error wasn't raised. Can I ask you to modify original release code, that has the failing validation section removed by Josep? Just add those 2 lines to let original exception raise with original error message:
Just as a suggestion: Lithops already shows this message when the private key is not valid:
2021-11-25 23:59:35,670 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2 -- SSH to 172.31.25.190 failed (publickey): Authentication failed.
I wonder: If instead of relying on platform specific logic to verify the ssh key, which has much more overhead, wouldn't doing this be enough and compatible for all backends ?
@kpavel
lithops.standalone.standalone.LithopsValidationError: Private ssh key /home/ubuntu/.ssh/ibm_cloud_vpc and public key first-id-rsa on master VM instance metaspace-staging-vpc-128 (161.156.169.205) are not a pair
We have a few public keys on VPC.
UPD: I looked on VPC overview
on website, we have the next order of ssh keys:
Provisioned SSH keys
* first-id-rsa
* second-id-rsa
* ibm-cloud-vpc-staging
I think this is the problem.
@sergii-mamedov Is it possible you have multiple private ssh keys in the /home/ubuntu/.ssh
dir? If so, maybe this one (/home/ubuntu/.ssh/ibm_cloud_vpc) is not valid, and another one in the /home/ubuntu/.ssh
dir is the valid.
UPD: If you try to login to the VM with:
ssh -i /home/ubuntu/.ssh/ibm_cloud_vpc -o IdentitiesOnly=yes -F /dev/null root@161.156.169.205
does it work?
ssh -i /home/ubuntu/.ssh/ibm_cloud_vpc -o IdentitiesOnly=yes -F /dev/null root@161.156.169.205
yes, it is works
On VPC we have 3 public key. On our machine, where lithpos starts, we have two privat keys: ibm_cloud_vpc
and another one.
@sergii-mamedov
If you run: ssh-keygen -y -f /home/ubuntu/.ssh/ibm_cloud_vpc | cut -d' ' -f 2
do you have as output your ssh public key?
@kpavel yep
@sergii-mamedov Can i ask you modify one line of code to have thrown exception contain also contents of the compared public keys?
raise LithopsValidationError(
f"Private ssh key {key_filename} and public key "
f"{key_name} on master {self} are not a pair because {public_res} not equal to {private_res}. "
f"the public is from {self.ibm_vpc_client.get_key(key_id).get_result()['public_key']}")
I cut ssh key
lithops.standalone.standalone.LithopsValidationError: Private ssh key /home/ubuntu/.ssh/ibm_cloud_vpc and public key first-id-rsa on master VM instance metaspace-staging-vpc-128 (161.156.169.205) are not a pair because AAAAB3Nza......Ni37PW8HrtIw== not equal to /bin/sh: 1: ssh-keygen: not found
/bin/sh: 1: cut: not found. the public is from ssh-rsa AAAAB3Nz...Ni37PW8HrtIw== email@site.com
it looks like lithops tries the private key with the first public key on the master VM and immediately gets an exception
Some strange about ssh-keygen, because:
$ which ssh-keygen
/usr/bin/ssh-keygen
@sergii-mamedov
I don't think that VPC allows you to have more than a single public key per VM instance.
But the issue i see is that both ssh-keygen and cut are missing from /bin/sh shell
Can if you run to get to sh shell:
sh
And then try to run from that shell cut
or ssh-keygen
?
ubuntu@ip-172-31-35-172:~$ which sh
/bin/sh
ubuntu@ip-172-31-35-172:~$ /bin/sh
$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/ubuntu/.ssh/id_rsa): ^C
$ cut
cut: you must specify a list of bytes, characters, or fields
Try 'cut --help' for more information.
$
I tried on machine, where we launch lihtops:
$ ipython
Python 3.8.6 | packaged by conda-forge | (default, Nov 27 2020, 19:31:52)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.19.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import subprocess
In [3]: subprocess.getoutput([f'ssh-keygen -y -f /home/ubuntu/.ssh/ibm_cloud_vpc | cut -d " " -f 2'])
Out[3]: 'AAAAB3NzaC1yc2EAAAADAQABAAACAQChj/kuF457SkwpU6ZE/cRInuS0LheGPctRt4Esa5PZ9PXHfdpVO0eYrtS3r9s875Bg6xAn94xqfBmyynCmAOjdjzFmz7p04kOhqMHAwn58812VLJyOnG7RB1on5GnahUqNhHs1RaE8EhXvfJXFqfrhc7a2QbftXxgfXz/Yk3Wrr1tw/d5vhC0RYCl8qKyjABJKgkmX82K/Fr9Uer1v/6XUnOTaAo2WN/ElQRRp916ej3fGCuFXYfKeUNrBl0C1bsQ1BAtkLRvbeA6MpYeUeksW/fdfOKY7RPJ0OGeqpaFYUWz+qWQvOk33Hqhu6L5TRVm1lSV9+2iZAhM3eEDLUvbK+yfNCMhJ2o9K/0gX3yVwnmP4rmXjN15YudZCego/ZuBcwznaXRTSbSFKzD/fah65+YETxxNn99mf2m1USX4frsGO2hWrW4OtLlh5DlXjWnBikB0Stwe2bRjlEVyYyRyzBlOrmu3UWyNsWOjDs5mMyOgmS++Q/9LAdXO/XBuJRZ5K/5MIFaBgn4gmdTmV1GLL4BSfX9HC/fByQiwx0RIU5CD3igmaYcJwm88g0wYmghqTLUJbDOdARvS2EhTgKE3tnHCYpSFXZ3iXBQGgsuI1EU21bSp0BwkZG8E+VILpKzp0igC4996+Zbln9rykUjVmCt7oa/zl1I6tDIR3/Zv73Q=='
@sergii-mamedov Have another idea. I suspect that something could override the PATH at some point. Can you please try to specify full paths to the binaries?
Please replace the line below with: private_res = subprocess.getoutput([f"/usr/bin/ssh-keygen -y -f {key_filename} | /usr/bin/cut -d' ' -f 2"])
@sergii-mamedov The issue should be resolved by PR #865
@kpavel I tried the master branch. I see the same problem. I tried deleting all public keys (~/.ssh/authorized_keys) except the one using lithops - same error. Apparently, lithops takes information and public keys not from this file, but directly from the VM metadata.
I will try to recreate the VM with one key, then add the other two to the ~/.ssh/authorized_key
file and test this hypothesis. Will it help you?
I will try to recreate the VM with one key, then add the other two to the
~/.ssh/authorized_key
file and test this hypothesis.
In this case, everything works without problems. Apparently the problem is on the side of IBM. @kpavel
@sergii-mamedov Thanks for the update. Working on a fix.
@sergii-mamedov Added support for multiple public keys validation in #867
@sergii-mamedov Should be fixed in master now
@kpavel It is sucked here:
2021-11-30 13:16:20,646 - DEBUG - lithops.standalone.backends.ibm_vpc.ibm_vpc[Thread-1-ex] - ibm_vpc.py:619 - SSH to None failed (publickey): 'key_id'
2021-11-30 13:16:25,907 - DEBUG - lithops.standalone.backends.ibm_vpc.ibm_vpc[Thread-1-ex] - ibm_vpc.py:619 - SSH to None failed (publickey): 'key_id'
2021-11-30 13:16:31,076 - DEBUG - lithops.standalone.backends.ibm_vpc.ibm_vpc[Thread-1-ex] - ibm_vpc.py:619 - SSH to None failed (publickey): 'key_id'
I think that I will recreate the VM with one key, since we already lost a lot of time for debugging.
@kpavel Consume mode does not have key_id
in config since the VM is manually created trough the dashboard
@sergii-mamedov Getting that message means it passed the ssh keypair validation @JosepSampe Will fix it asap
Submitted new fix https://github.com/lithops-cloud/lithops/pull/868
@sergii-mamedov Revalidated with my instance running in consume mode with multiple ssh keys. It works. @JosepSampe Fixed the non mandatory ssh key id during validation. validation actually doesn't need to know key id.
@kpavel This time everything works without problems. Thanks.
@kpavel This time everything works without problems. Thanks.
Excellent. Sorry it took that long.
After upgrading to version 2.5.7, I found that the latest version of lithops is not uploaded to the VPC (the old version remains). I tried deleting the VPC to create a new one. In this case, the lithops process does not start on the VPC at all. I tried the previous version (2.5.6) - same result. The last version we used that didn't have this problem was 2.5.0. It seems that the reason is the recent changes to this function. We are using
consume
mode for VPC. Stacktrace on our side:@JosepSampe can you look on it?