lithops-cloud / lithops

A multi-cloud framework for big data analytics and embarrassingly parallel jobs, that provides an universal API for building parallel applications in the cloud ☁️🚀
http://lithops.cloud
Apache License 2.0
319 stars 105 forks source link

Workers not able to be created message error - but machines exist #1410

Open RichardScottOZ opened 1 day ago

RichardScottOZ commented 1 day ago

raceback (most recent call last):
  File lithopstest.py", line 44, in <module>
    fexec.map(my_map_function, args)
  File "/home/richard/miniconda3/envs/lithops/lib/python3.10/site-packages/lithops/executors.py", line 279, in map
    futures = self.invoker.run_job(job)
  File "/home/richard/miniconda3/envs/lithops/lib/python3.10/site-packages/lithops/invokers.py", line 280, in run_job
    futures = self._run_job(job)
  File "/home/richard/miniconda3/envs/lithops/lib/python3.10/site-packages/lithops/invokers.py", line 222, in _run_job
    raise e
  File "/home/richard/miniconda3/envs/lithops/lib/python3.10/site-packages/lithops/invokers.py", line 219, in _run_job
    self._invoke_job(job)
  File "/home/richard/miniconda3/envs/lithops/lib/python3.10/site-packages/lithops/invokers.py", line 267, in _invoke_job
    activation_id = self.compute_handler.invoke(payload)
  File "/home/richard/miniconda3/envs/lithops/lib/python3.10/site-packages/lithops/standalone/standalone.py", line 286, in invoke
    raise Exception('It was not possible to create any worker')
Exception: It was not possible to create any worker

image

Possibly a public/private check issue?

RichardScottOZ commented 1 day ago

This is just running one of the example files in the repo doing some simple calcs.

RichardScottOZ commented 1 day ago
timed out
2024-11-23 06:49:50,337 [DEBUG] standalone.py:277 -- Found 0 free workers connected to VM instance lithops-master-dadab3
2024-11-23 06:49:50,337 [DEBUG] standalone.py:281 -- Going to create 2 new workers
{'username': 'ubuntu', 'password': 'c0d72593-5e1f-458b-aaad-74ab52be154d', 'key_filename': 'lithops-key-b597095a.aws_ec2.id_rsa'}
{'username': 'ubuntu', 'password': 'c0d72593-5e1f-458b-aaad-74ab52be154d', 'key_filename': 'lithops-key-b597095a.aws_ec2.id_rsa'}
2024-11-23 06:49:50,339 [DEBUG] aws_ec2.py:1223 -- Creating new VM instance lithops-worker-10c28285 (Spot)
2024-11-23 06:49:50,341 [DEBUG] aws_ec2.py:1223 -- Creating new VM instance lithops-worker-74d89509 (Spot)
2024-11-23 06:49:57,315 [DEBUG] standalone.py:256 -- Total worker VM instances created: 0/2
Traceback (most recent call last):
  File lithopstest.py", line 44, in <module>
    fexec.map(my_map_function, args)
  File "/home/richard/miniconda3/envs/lithops/lib/python3.10/site-packages/lithops/executors.py", line 279, in map
    futures = self.invoker.run_job(job)
  File "/home/richard/miniconda3/envs/lithops/lib/python3.10/site-packages/lithops/invokers.py", line 280, in run_job
    futures = self._run_job(job)
  File "/home/richard/miniconda3/envs/lithops/lib/python3.10/site-packages/lithops/invokers.py", line 222, in _run_job
    raise e
  File "/home/richard/miniconda3/envs/lithops/lib/python3.10/site-packages/lithops/invokers.py", line 219, in _run_job
    self._invoke_job(job)
  File "/home/richard/miniconda3/envs/lithops/lib/python3.10/site-packages/lithops/invokers.py", line 267, in _invoke_job
    activation_id = self.compute_handler.invoke(payload)
  File "/home/richard/miniconda3/envs/lithops/lib/python3.10/site-packages/lithops/standalone/standalone.py", line 286, in invoke
    raise Exception('It was not possible to create any worker')
Exception: It was not possible to create any worker

I checked the ssh connection error 'timed out' so possibly there or an interaction between the spot api and something else - two machines were created again. I have several alive from the tests last night too - so this error presumably preventing auto shutdown.

RichardScottOZ commented 1 day ago

some testing yesterday I did this, which enabled the master VM to start

def get_ssh_client(self):
        """
        Creates an ssh client against the VM
        """
        #if self.public:
        if self.public and 1 == 2:  #Richard 20241122
            if not self.ssh_client or self.ssh_client.ip_address != self.public_ip:
                self.ssh_client = SSHClient(self.public_ip, self.ssh_credentials)
        else:
            if not self.ssh_client or self.ssh_client.ip_address != self.private_ip:
                self.ssh_client = SSHClient(self.private_ip, self.ssh_credentials)

        return self.ssh_client
RichardScottOZ commented 1 day ago

Here, looking at public IP I think?

2024-11-23 07:10:32,657 [DEBUG] ssh_client.py:59 -- 10.1.111.11 ssh client created
2024-11-23 07:10:34,094 [DEBUG] standalone.py:277 -- Found 0 free workers connected to VM instance lithops-master-dadab3 (53.92.198.72)
RichardScottOZ commented 1 day ago

When I changed it from default spot, got further

2024-11-23 07:22:02,997 [DEBUG] aws_ec2.py:1275 -- VM instance lithops-worker-f12ca4e5 created successfully
2024-11-23 07:22:03,973 [DEBUG] aws_ec2.py:1275 -- VM instance lithops-worker-d10f6430 created successfully
2024-11-23 07:22:03,973 [DEBUG] standalone.py:259 -- Total worker VM instances created: 2/2
2024-11-23 07:22:03,973 [DEBUG] standalone.py:291 -- ExecutorID 661f55-0 | JobID M000 - Going to run 3 activations in 2 workers
2024-11-23 07:22:09,590 [DEBUG] invokers.py:271 -- ExecutorID 661f55-0 | JobID M000 - Job invoked (11.244s) - Activation ID: 661f55-0-M000
2024-11-23 07:22:09,590 [INFO] invokers.py:225 -- ExecutorID 661f55-0 | JobID M000 - View execution logs at /tmp/lithops-richard/logs/661f55-0-M000.log
2024-11-23 07:22:09,590 [DEBUG] monitor.py:429 -- ExecutorID 661f55-0 - Starting Storage job monitor
2024-11-23 07:22:09,590 [INFO] executors.py:494 -- ExecutorID 661f55-0 - Getting results from 3 function activations
2024-11-23 07:22:09,590 [INFO] wait.py:101 -- ExecutorID 661f55-0 - Waiting for 3 function activations to complete
2024-11-23 07:22:13,440 [DEBUG] monitor.py:144 -- ExecutorID 661f55-0 - Pending: 3 - Running: 0 - Done: 0
2024-11-23 07:22:53,439 [DEBUG] monitor.py:144 -- ExecutorID 661f55-0 - Pending: 3 - Running: 0 - Done: 0
2024-11-23 07:23:34,012 [DEBUG] monitor.py:144 -- ExecutorID 661f55-0 - Pending: 3 - Running: 0 - Done: 0
RichardScottOZ commented 1 day ago

I wouldn't think that example takes very long to run though

2024-11-23 07:22:09,590 [INFO] wait.py:101 -- ExecutorID 661f55-0 - Waiting for 3 function activations to complete
2024-11-23 07:22:13,440 [DEBUG] monitor.py:144 -- ExecutorID 661f55-0 - Pending: 3 - Running: 0 - Done: 0
2024-11-23 07:22:53,439 [DEBUG] monitor.py:144 -- ExecutorID 661f55-0 - Pending: 3 - Running: 0 - Done: 0
2024-11-23 07:23:34,012 [DEBUG] monitor.py:144 -- ExecutorID 661f55-0 - Pending: 3 - Running: 0 - Done: 0
2024-11-23 07:24:14,118 [DEBUG] monitor.py:144 -- ExecutorID 661f55-0 - Pending: 3 - Running: 0 - Done: 0
2024-11-23 07:24:54,269 [DEBUG] monitor.py:144 -- ExecutorID 661f55-0 - Pending: 3 - Running: 0 - Done: 0
2024-11-23 07:25:36,625 [DEBUG] monitor.py:144 -- ExecutorID 661f55-0 - Pending: 3 - Running: 0 - Done: 0
RichardScottOZ commented 1 day ago

and with not spot

lithops worker list tells me

Worker Name              Created                  Instance Type      Processes  Runtime    Mode    Status      TTD
-----------------------  -----------------------  ---------------  -----------  ---------  ------  ----------  -------
lithops-worker-f12ca4e5  2024-11-22 20:52:09 UTC  t3.medium                  2  python3    reuse   installing  Unknown
lithops-worker-d10f6430  2024-11-22 20:52:09 UTC  t3.medium                  2  python3    reuse   installing  Unknown

lithops job list has

Job ID         Function Name      Submitted                Worker Type    Runtime    Tasks Done    Job Status
-------------  -----------------  -----------------------  -------------  ---------  ------------  ------------
661f55-0-M000  my_map_function()  2024-11-22 20:51:58 UTC  t3.medium      python3    0/3           submitted
RichardScottOZ commented 1 day ago

I did not change any defaults from anything and default python3 - does installing status mean it is expecting a container called python3?

I definitely do not have a runtime set in the config.

In invokers

        if self.backend not in STANDALONE_BACKENDS:
            logger.debug(
                f'ExecutorID {job.executor_id} | JobID {job.job_id} - Worker processes: '
                f'{job.worker_processes} - Chunksize: {job.chunksize}'
            )

        try:
            print("TRYING RUNTIME:",self.runtime_name)
            job.runtime_name = self.runtime_name
            self._invoke_job(job)
        except (KeyboardInterrupt, Exception) as e:
            self.stop()
            raise e

gives

TRYING RUNTIME: python3

so should that actually be anything? For this default 'just do some calcs in the OS python3?

e.g. I left this going for some time

Option 1: By default, Lithops uses an Ubuntu 22.04 image. In this case, no further action is required and you can continue to the next step. Lithops will install all required dependencies in the VM by itself. Notice this can consume about 3 min to complete all installations.

Probably 15 minutes plus - I could leave it going for an extended period, but doesn't seem like it was going to work.

RichardScottOZ commented 1 day ago

Next test for timeout problems then I guess it to make a simple image or something along those lines.

RichardScottOZ commented 1 day ago

try redoing the default and see what happens?

RichardScottOZ commented 1 day ago

When I try that

2024-11-23 08:39:18,586 [DEBUG] aws_ec2.py:1276 -- VM instance building-image-lithops-ubuntu-jammy-22.04-amd64-server created successfully
2024-11-23 08:39:18,586 [DEBUG] aws_ec2.py:1132 -- Waiting VM instance building-image-lithops-ubuntu-jammy-22.04-amd64-server to become ready
but ssh create connection
timed out error

and

2024-11-23 08:39:27,666 [DEBUG] aws_ec2.py:1123 -- SSH to 0.0.0.0 failed (publickey): timed out

RichardScottOZ commented 1 day ago

so how to get this

  def is_ready(self):
        """
        Checks if the VM instance is ready to receive ssh connections
        """
        login_type = 'password' if 'password' in self.ssh_credentials and \
            not self.public else 'publickey'
        try:
            self.get_ssh_client().run_remote_command('id')
        except LithopsValidationError as err:
            raise err
        except Exception as err:
            #logger.debug(f'SSH to {self.public_ip if self.public else self.private_ip} failed ({login_type}): {err}')
            #logger.debug(f'SSH to {self.public_ip if self.public else self.private_ip} failed ({login_type}): {err}')
            logger.debug(f'SSH to {self.public_ip if self.public else self.private_ip} failed ({login_type}): {err}')
            self.del_ssh_client()
            return False
        return True

in aws_ec2 to handle a private case?

RichardScottOZ commented 1 day ago

again though it looks like it has made it - in the console

image

RichardScottOZ commented 1 day ago

and if I disable the error handling in the above

timed out
Traceback (most recent call last):
  File "/home/richard/miniconda3/envs/lithops/bin/lithops", line 8, in <module>
    sys.exit(lithops_cli())
  File "/home/richard/miniconda3/envs/lithops/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/richard/miniconda3/envs/lithops/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/richard/miniconda3/envs/lithops/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/richard/miniconda3/envs/lithops/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/richard/miniconda3/envs/lithops/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/richard/miniconda3/envs/lithops/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/richard/miniconda3/envs/lithops/lib/python3.10/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/richard/miniconda3/envs/lithops/lib/python3.10/site-packages/lithops/scripts/cli.py", line 851, in build_image
    compute_handler.build_image(name, file, overwrite, include, ctx.args)
  File "/home/richard/miniconda3/envs/lithops/lib/python3.10/site-packages/lithops/standalone/standalone.py", line 88, in build_image
    self.backend.build_image(image_name, script_file, overwrite, include, extra_args)
  File "/home/richard/miniconda3/envs/lithops/lib/python3.10/site-packages/lithops/standalone/backends/aws_ec2/aws_ec2.py", line 659, in build_image
    build_vm.get_ssh_client().upload_data_to_file(script, remote_script)
  File "/home/richard/miniconda3/envs/lithops/lib/python3.10/site-packages/lithops/util/ssh_client.py", line 145, in upload_data_to_file
    self.ssh_client = self.create_client()
  File "/home/richard/miniconda3/envs/lithops/lib/python3.10/site-packages/lithops/util/ssh_client.py", line 62, in create_client
    raise e
  File "/home/richard/miniconda3/envs/lithops/lib/python3.10/site-packages/lithops/util/ssh_client.py", line 52, in create_client
    self.ssh_client.connect(
  File "/home/richard/miniconda3/envs/lithops/lib/python3.10/site-packages/paramiko/client.py", line 386, in connect
    sock.connect(addr)
TimeoutError: timed out
RichardScottOZ commented 1 day ago

which could be unintended consequence of my tinkering to get some other private_ip things to work?

RichardScottOZ commented 1 day ago

so failing to create a ssh_client on the run_remote part above

not sure why failing to connect to these build machines when others are ok as far as ssh connection via private ip

RichardScottOZ commented 1 day ago

I can definitely connect manually to a build-* machine anyway via the private ip so not sure why code is failing here.

RichardScottOZ commented 1 day ago

This run it has got to building an image, taking a while to do, but that is a bit further. Don't know what was difference, possibly of badly flaky vpn and internet at the time earlier I guess.

RichardScottOZ commented 1 day ago
2024-11-23 12:01:09,849 [DEBUG] aws_ec2.py:696 -- VM Image is being created. Current status: pending
2024-11-23 12:01:31,280 [DEBUG] aws_ec2.py:696 -- VM Image is being created. Current status: available
2024-11-23 12:01:31,280 [DEBUG] aws_ec2.py:1405 -- Deleting VM instance building-image-lithops-ubuntu-jammy-22.04-amd64-server (i-003042a7a1602ea80)
2024-11-23 12:01:31,753 [INFO] aws_ec2.py:707 -- VM Image created. Image ID: ami-06ea522e39999999
2024-11-23 12:01:31,753 [INFO] cli.py:853 -- VM Image built
RichardScottOZ commented 1 day ago

but trying that ami- in target_ami, back to solving this problem:

2024-11-23 12:28:29,116 [DEBUG] monitor.py:429 -- ExecutorID 017dd6-0 - Starting Storage job monitor
2024-11-23 12:28:29,116 [INFO] executors.py:494 -- ExecutorID 017dd6-0 - Getting results from 3 function activations
2024-11-23 12:28:29,116 [INFO] wait.py:101 -- ExecutorID 017dd6-0 - Waiting for 3 function activations to complete
2024-11-23 12:28:32,845 [DEBUG] monitor.py:144 -- ExecutorID 017dd6-0 - Pending: 3 - Running: 0 - Done: 0
2024-11-23 12:29:14,093 [DEBUG] monitor.py:144 -- ExecutorID 017dd6-0 - Pending: 3 - Running: 0 - Done: 0
2024-11-23 12:29:55,072 [DEBUG] monitor.py:144 -- ExecutorID 017dd6-0 - Pending: 3 - Running: 0 - Done: 0
2024-11-23 12:30:36,118 [DEBUG] monitor.py:144 -- ExecutorID 017dd6-0 - Pending: 3 - Running: 0 - Done: 0
2024-11-23 12:31:16,790 [DEBUG] monitor.py:144 -- ExecutorID 017dd6-0 - Pending: 3 - Running: 0 - Done: 0
2024-11-23 12:31:56,778 [DEBUG] monitor.py:144 -- ExecutorID 017dd6-0 - Pending: 3 - Running: 0 - Done: 0
2024-11-23 12:32:38,720 [DEBUG] monitor.py:144 -- ExecutorID 017dd6-0 - Pending: 3 - Running: 0 - Done: 0
2024-11-23 12:33:19,958 [DEBUG] monitor.py:144 -- ExecutorID 017dd6-0 - Pending: 3 - Running: 0 - Done: 0
2024-11-23 12:34:01,740 [DEBUG] monitor.py:144 -- ExecutorID 017dd6-0 - Pending: 3 - Running: 0 - Done: 0
2024-11-23 12:34:43,345 [DEBUG] monitor.py:144 -- ExecutorID 017dd6-0 - Pending: 3 - Running: 0 - Done: 0
2024-11-23 12:35:25,381 [DEBUG] monitor.py:144 -- ExecutorID 017dd6-0 - Pending: 3 - Running: 0 - Done: 0

e.g. this happily runs infinite loop

RichardScottOZ commented 1 day ago

I let it run for an hour then ctrl-C'ed it.

2024-11-23 13:30:40,473 [DEBUG] ssh_client.py:65 -- 10.0.0.0  ssh client created
2024-11-23 13:30:40,683 [DEBUG] monitor.py:457 -- ExecutorID 017dd6-0 - Storage job monitor finished
2024-11-23 13:30:41,571 [INFO] executors.py:618 -- ExecutorID 017dd6-0 - Cleaning temporary data
RichardScottOZ commented 1 day ago

job payload has this at the start

JOB PAYLOAD: {'config': {'lithops': {'backend': 'aws_ec2', 'log_level': 'DEBUG', 'mode': 'standalone', 'chunksize': 0, 'storage': 'aws_s3', 'monitoring': 'storage', 'monitoring_interval': 2, 'execution_timeout': 1800, 'backend_type': 'batch'},
RichardScottOZ commented 18 hours ago

Experiments are here: https://github.com/RichardScottOZ/lithops/tree/private_ip

RichardScottOZ commented 17 hours ago

These exist - and are using BatchInvoker

FUTURES: [<lithops.future.ResponseFuture object at 0x7fc5b3446fb0>, <lithops.future.ResponseFuture object at 0x7fc5b00cc7f0>, <lithops.future.ResponseFuture object at 0x7fc5b00cc850>]
RichardScottOZ commented 17 hours ago

Will simplify very slightly and just use this https://github.com/lithops-cloud/lithops/blob/master/examples/call_async.py - so one not 3 things.

RichardScottOZ commented 16 hours ago

The latest test of that, the worker-service-log

cat /tmp/lithops-root/worker-service.log
2024-11-23 23:19:26,575 [ERROR] lithops.standalone.worker:107 -- Timeout connecting to server
2024-11-23 23:19:27,361 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:60 -- Creating AWS EC2 client
2024-11-23 23:19:29,680 [INFO] lithops.standalone.backends.aws_ec2.aws_ec2:103 -- AWS EC2 client created - Region: us-east-1
2024-11-23 23:19:29,681 [DEBUG] lithops.standalone.standalone:69 -- Standalone handler created successfully
2024-11-23 23:19:29,681 [DEBUG] lithops.standalone.keeper:53 -- Starting BudgetKeeper for lithops-worker-260b9387 (10.6.131.122), instance ID: i-075b34f8d528f726e
2024-11-23 23:19:29,681 [DEBUG] lithops.standalone.keeper:55 -- Delete lithops-worker-260b9387 on dismantle: True
2024-11-23 23:19:29,682 [DEBUG] lithops.standalone.keeper:72 -- BudgetKeeper started
2024-11-23 23:19:29,682 [DEBUG] lithops.standalone.keeper:75 -- Auto dismantle activated - Soft timeout: 300s, Hard Timeout: 3600s
2024-11-23 23:19:29,682 [DEBUG] lithops.standalone.keeper:108 -- Time to dismantle: 3596 seconds
2024-11-23 23:19:29,683 [INFO] lithops.standalone.worker:263 -- Starting Worker - Instace type: t3.medium - Runtime name: python3 - Worker processes: 2
2024-11-23 23:19:29,683 [INFO] lithops.standalone.worker:154 -- Redis consumer process 0 started
2024-11-23 23:19:29,684 [INFO] lithops.standalone.worker:154 -- Redis consumer process 1 started
2024-11-23 23:20:29,743 [DEBUG] lithops.standalone.keeper:108 -- Time to dismantle: 3536 seconds
RichardScottOZ commented 16 hours ago

So does that mean that is the problem, workers can't get to master, or is that just first try and then they do?

/tmp/lithops-root/jobs and /logs are empty anyway

RichardScottOZ commented 16 hours ago

on the worker

curl -X GET http://lithops-master:8080/ping
{"response":"3.5.1"}

and 8080 and 8081 need to be reachable for master and workers

RichardScottOZ commented 14 hours ago

when I logged into a worker when starting I did see a python3 task appear via top - so something running but results not being collected?

RichardScottOZ commented 14 hours ago

If wanted to put additional logging on workers - would have to do that, then make a new master so it would have an updated worker.py ?

RichardScottOZ commented 14 hours ago

current test from new, master log

buntu@ip-10-6-131-104:~$ cat /tmp/lithops-root/master-service.log
2024-11-24 01:27:54,597 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:60 -- Creating AWS EC2 client
2024-11-24 01:27:55,650 [INFO] lithops.standalone.backends.aws_ec2.aws_ec2:103 -- AWS EC2 client created - Region: us-east-1
2024-11-24 01:27:55,651 [DEBUG] lithops.standalone.standalone:69 -- Standalone handler created successfully
2024-11-24 01:27:55,652 [DEBUG] lithops.standalone.keeper:53 -- Starting BudgetKeeper for lithops-master-dadab3 (10.6.131.104), instance ID: i-021f144b3714bb841
2024-11-24 01:27:55,652 [DEBUG] lithops.standalone.keeper:55 -- Delete lithops-master-dadab3 on dismantle: False
2024-11-24 01:27:55,652 [DEBUG] lithops.standalone.keeper:72 -- BudgetKeeper started
2024-11-24 01:27:55,652 [DEBUG] lithops.standalone.keeper:75 -- Auto dismantle activated - Soft timeout: 300s, Hard Timeout: 3600s
2024-11-24 01:27:55,652 [DEBUG] lithops.standalone.keeper:108 -- Time to dismantle: 3597 seconds
2024-11-24 01:27:55,652 [INFO] lithops.standalone.master:573 -- Starting job monitoring thread
2024-11-24 01:28:08,208 [DEBUG] lithops.standalone.master:545 -- Received job 86bf30-0-A000
2024-11-24 01:28:08,209 [DEBUG] lithops.standalone.master:375 -- Going to setup 1 workers
2024-11-24 01:28:08,209 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:60 -- Creating AWS EC2 client
2024-11-24 01:28:08,224 [DEBUG] lithops.standalone.master:526 -- Job 86bf30-0-A000 correctly submitted to work queue 'wq:t3.medium-2-python3'
2024-11-24 01:28:08,611 [INFO] lithops.standalone.backends.aws_ec2.aws_ec2:103 -- AWS EC2 client created - Region: us-east-1
2024-11-24 01:28:08,616 [DEBUG] lithops.standalone.standalone:69 -- Standalone handler created successfully
2024-11-24 01:28:08,617 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1143 -- Waiting VM instance lithops-worker-21337e83 to become ready
2024-11-24 01:28:09,330 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.1 failed (publickey): Authentication failed.
2024-11-24 01:28:14,643 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.1 failed (publickey): Authentication failed.
2024-11-24 01:28:20,004 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.1 failed (publickey): Authentication failed.
2024-11-24 01:28:25,273 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.1 failed (publickey): Authentication failed.
2024-11-24 01:28:30,625 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.1 failed (publickey): Authentication failed.
2024-11-24 01:28:35,898 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.1 failed (publickey): Authentication failed.
2024-11-24 01:28:41,215 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.1 failed (publickey): Authentication failed.
2024-11-24 01:28:46,583 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.1 failed (publickey): Authentication failed.
2024-11-24 01:28:51,926 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.1 failed (publickey): Authentication failed.
2024-11-24 01:28:55,684 [DEBUG] lithops.standalone.keeper:108 -- Time to dismantle: 3552 seconds
2024-11-24 01:28:57,202 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.1 failed (publickey): Authentication failed.
2024-11-24 01:29:02,486 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.1 failed (publickey): Authentication failed.
2024-11-24 01:29:07,782 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.1 failed (publickey): Authentication failed.
2024-11-24 01:29:13,156 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.1 failed (publickey): Authentication failed.
2024-11-24 01:29:18,572 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.1 failed (publickey): Authentication failed.
2024-11-24 01:29:23,855 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.1 failed (publickey): Authentication failed.
2024-11-24 01:29:29,257 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.1 failed (publickey): Authentication failed.
2024-11-24 01:29:34,674 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.1 failed (publickey): Authentication failed.
2024-11-24 01:29:40,083 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.1 failed (publickey): Authentication failed.
2024-11-24 01:29:45,422 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.1 failed (publickey): Authentication failed.
2024-11-24 01:29:50,803 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.1 failed (publickey): Authentication failed.
2024-11-24 01:29:55,729 [DEBUG] lithops.standalone.keeper:108 -- Time to dismantle: 3492 seconds
2024-11-24 01:29:56,106 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.1 failed (publickey): Authentication failed.
2024-11-24 01:30:01,602 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.1 failed (publickey): Authentication failed.
2024-11-24 01:30:06,922 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.1 failed (publickey): Authentication failed.
2024-11-24 01:30:12,332 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.1 failed (publickey): Authentication failed.
2024-11-24 01:30:17,618 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.1 failed (publickey): Authentication failed.
2024-11-24 01:30:22,981 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.1 failed (publickey): Authentication failed.
2024-11-24 01:30:28,258 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.1 failed (publickey): Authentication failed.
2024-11-24 01:30:33,584 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.1 failed (publickey): Authentication failed.
2024-11-24 01:30:38,878 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.1 failed (publickey): Authentication failed.
2024-11-24 01:30:44,247 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.1 failed (publickey): Authentication failed.
2024-11-24 01:30:49,612 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.1 failed (publickey): Authentication failed.
2024-11-24 01:30:54,881 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.1 failed (publickey): Authentication failed.
2024-11-24 01:30:55,789 [DEBUG] lithops.standalone.keeper:108 -- Time to dismantle: 3432 seconds
2024-11-24 01:31:00,344 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.1 failed (publickey): Authentication failed.
2024-11-24 01:31:05,689 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.1 failed (publickey): Authentication failed.
2024-11-24 01:31:10,695 [WARNING] lithops.standalone.master:263 -- Timeout Error. Recreating VM instance lithops-worker-21337e83 (100.1.0.1)
2024-11-24 01:31:10,695 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1407 -- Deleting VM instance lithops-worker-21337e83 (i-057bfea7678c06832)
2024-11-24 01:31:11,159 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1273 -- Creating new VM instance lithops-worker-21337e83
2024-11-24 01:31:12,244 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1288 -- VM instance lithops-worker-21337e83 created successfully
2024-11-24 01:31:12,244 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1143 -- Waiting VM instance lithops-worker-21337e83 to become ready
2024-11-24 01:31:14,436 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.2 failed (publickey): timed out
2024-11-24 01:31:21,576 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.2 failed (publickey): timed out
2024-11-24 01:31:28,741 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.2 failed (publickey): timed out
2024-11-24 01:31:35,927 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.2 failed (publickey): timed out
2024-11-24 01:31:43,078 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.2 failed (publickey): timed out
2024-11-24 01:31:50,271 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.2 failed (publickey): timed out
2024-11-24 01:31:55,445 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.2 failed (publickey): [Errno None] Unable to connect to port 22 on 100.1.0.2
2024-11-24 01:31:55,823 [DEBUG] lithops.standalone.keeper:108 -- Time to dismantle: 3372 seconds
2024-11-24 01:32:00,680 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.2 failed (publickey): [Errno None] Unable to connect to port 22 on 100.1.0.2
2024-11-24 01:32:06,350 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.2 failed (publickey): Authentication failed.
2024-11-24 01:32:11,750 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.2 failed (publickey): Authentication failed.
2024-11-24 01:32:17,202 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.2 failed (publickey): Authentication failed.
2024-11-24 01:32:22,584 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.2 failed (publickey): Authentication failed.
2024-11-24 01:32:27,972 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.2 failed (publickey): Authentication failed.
2024-11-24 01:32:33,419 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.2 failed (publickey): Authentication failed.
2024-11-24 01:32:38,912 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.2 failed (publickey): Authentication failed.
2024-11-24 01:32:44,293 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.2 failed (publickey): Authentication failed.
2024-11-24 01:32:49,733 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.2 failed (publickey): Authentication failed.
2024-11-24 01:32:55,320 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.2 failed (publickey): Authentication failed.
2024-11-24 01:32:55,883 [DEBUG] lithops.standalone.keeper:108 -- Time to dismantle: 3312 seconds
2024-11-24 01:33:00,696 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.2 failed (publickey): Authentication failed.
2024-11-24 01:33:06,052 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.2 failed (publickey): Authentication failed.
2024-11-24 01:33:11,494 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.2 failed (publickey): Authentication failed.
2024-11-24 01:33:16,841 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.2 failed (publickey): Authentication failed.
2024-11-24 01:33:22,272 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.2 failed (publickey): Authentication failed.
2024-11-24 01:33:27,743 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.2 failed (publickey): Authentication failed.
2024-11-24 01:33:33,146 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.2 failed (publickey): Authentication failed.
2024-11-24 01:33:38,532 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.2 failed (publickey): Authentication failed.
2024-11-24 01:33:43,933 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.2 failed (publickey): Authentication failed.
2024-11-24 01:33:49,417 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.2 failed (publickey): Authentication failed.
2024-11-24 01:33:54,839 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.2 failed (publickey): Authentication failed.
2024-11-24 01:33:55,895 [DEBUG] lithops.standalone.keeper:108 -- Time to dismantle: 3252 seconds
2024-11-24 01:34:00,247 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.2 failed (publickey): Authentication failed.
2024-11-24 01:34:05,598 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.2 failed (publickey): Authentication failed.
2024-11-24 01:34:10,962 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.2 failed (publickey): Authentication failed.
2024-11-24 01:34:15,972 [DEBUG] lithops.standalone.master:261 -- Readiness probe expired for VM instance lithops-worker-21337e83 (100.1.0.2)
2024-11-24 01:34:15,973 [ERROR] lithops.standalone.master:411 -- Readiness probe expired on VM instance lithops-worker-21337e83 (100.1.0.2)
2024-11-24 01:34:15,973 [DEBUG] lithops.standalone.master:413 -- 0 of 1 workers started for work queue: wq:t3.medium-2-python3
2024-11-24 01:34:55,955 [DEBUG] lithops.standalone.keeper:108 -- Time to dismantle: 3192 seconds
2024-11-24 01:35:56,015 [DEBUG] lithops.standalone.keeper:108 -- Time to dismantle: 3132 seconds
2024-11-24 01:36:56,075 [DEBUG] lithops.standalone.keeper:108 -- Time to dismantle: 3072 seconds
2024-11-24 01:37:56,135 [DEBUG] lithops.standalone.keeper:108 -- Time to dismantle: 3012 seconds
2024-11-24 01:38:56,196 [DEBUG] lithops.standalone.keeper:108 -- Time to dismantle: 2952 seconds
2024-11-24 01:39:56,206 [DEBUG] lithops.standalone.keeper:108 -- Time to dismantle: 2892 seconds
2024-11-24 01:40:56,263 [DEBUG] lithops.standalone.keeper:108 -- Time to dismantle: 2832 seconds
2024-11-24 01:41:56,323 [DEBUG] lithops.standalone.keeper:108 -- Time to dismantle: 2772 seconds
2024-11-24 01:42:56,355 [DEBUG] lithops.standalone.keeper:108 -- Time to dismantle: 2712 seconds
2024-11-24 01:43:56,415 [DEBUG] lithops.standalone.keeper:108 -- Time to dismantle: 2652 seconds
2024-11-24 01:44:56,475 [DEBUG] lithops.standalone.keeper:108 -- Time to dismantle: 2592 seconds
RichardScottOZ commented 13 hours ago

and when I test manually on the master

2024-11-24 02:00:53,336 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.2 failed (publickey): Authentication failed.
2024-11-24 02:00:58,667 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.2 failed (publickey): Authentication failed.
2024-11-24 02:01:04,092 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.2 failed (publickey): Authentication failed.
2024-11-24 02:01:07,876 [DEBUG] lithops.standalone.keeper:108 -- Time to dismantle: 3247 seconds
2024-11-24 02:01:09,489 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.2 failed (publickey): Authentication failed.
2024-11-24 02:01:14,877 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.2 failed (publickey): Authentication failed.
2024-11-24 02:01:20,299 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:1134 -- SSH to 100.1.0.2 failed (publickey): Authentication failed.
2024-11-24 02:01:25,310 [DEBUG] lithops.standalone.master:261 -- Readiness probe expired for VM instance lithops-worker-957f510e (100.1.0.2)
2024-11-24 02:01:25,415 [ERROR] lithops.standalone.master:411 -- Readiness probe expired on VM instance lithops-worker-957f510e (100.1.0.2)
2024-11-24 02:01:25,415 [DEBUG] lithops.standalone.master:413 -- 0 of 1 workers started for work queue: wq:t3.medium-2-0.1.0.23
2024-11-24 02:02:07,906 [DEBUG] lithops.standalone.keeper:108 -- Time to dismantle: 3187 seconds
q^C
ubuntu@ip-:~$ tail -f -n 100 /tmp/lithops-*/master-service.log^C
ubuntu@ip-:~$ cd .ssh
ubuntu@ip-:~/.ssh$ ls
authorized_keys  id_rsa  id_rsa.pub  lithops_id_rsa  lithops_id_rsa.pub
ubuntu@ip-:~/.ssh$ ssh -i lithops_id_rsa ubuntu@
Warning: Permanently added '' (ED25519) to the list of known hosts.
ubuntu@: Permission denied (publickey).
ubuntu@ip-:~/.ssh$ ssh -i id_rsa ubuntu@
Warning: Permanently added '' (ED25519) to the list of known hosts.
ubuntu@: Permission denied (publickey).
ubuntu@i:~/.ssh$
RichardScottOZ commented 13 hours ago

so I can ssh fine to master and workers with the key used in config - so ssh_client problems from my adaptation or an actual key problem

RichardScottOZ commented 13 hours ago

i uploaded the key used in the config to a test directory or the master - and I could ssh to the worker using those from the master

RichardScottOZ commented 8 hours ago

As an aside, I thought I would do another test - a completely vanilla install where I let the new lithops (3.51 - was 3.3 when I did my books processing earlier in the year) - and default code with public vpc etc. install the hello world ran fine.