Open sergii-mamedov opened 3 weeks ago
Hi @sergii-mamedov, thanks for the detailed description of the issue. I reviewed the code and identified a potential problem. I’ll work on fixing the logic, test it, and then submit a PR.
@sergii-mamedov I added a fix for this
@JosepSampe We may have fewer of these cases, but I still see this problem. Here is an example of logs:
$ tail /tmp/lithops-root/worker-service.log
2024-11-19 09:38:02,729 [INFO] lithops.standalone.worker:263 -- Starting Worker - Instace type: r6a.xlarge - Runtime name: metaspace2020/metaspace-aws-ec2:3.5.1 - Worker processes: 4
2024-11-19 09:38:02,729 [INFO] lithops.standalone.worker:154 -- Redis consumer process 0 started
2024-11-19 09:38:02,730 [INFO] lithops.standalone.worker:154 -- Redis consumer process 1 started
2024-11-19 09:38:02,730 [INFO] lithops.standalone.worker:154 -- Redis consumer process 2 started
2024-11-19 09:38:02,731 [INFO] lithops.standalone.worker:154 -- Redis consumer process 3 started
2024-11-19 09:38:03,731 [INFO] lithops.standalone.worker:219 -- Redis consumer process 0 finished
2024-11-19 09:38:03,731 [INFO] lithops.standalone.worker:219 -- Redis consumer process 1 finished
2024-11-19 09:38:03,732 [INFO] lithops.standalone.worker:219 -- Redis consumer process 2 finished
2024-11-19 09:38:03,732 [INFO] lithops.standalone.worker:219 -- Redis consumer process 3 finished
2024-11-19 09:38:03,732 [DEBUG] lithops.standalone.worker:289 -- Worker service finished
and master
$ more /tmp/lithops-root/master-service.log
2024-11-19 09:38:00,517 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:58 -- Creating AWS EC2 client
2024-11-19 09:38:01,431 [INFO] lithops.standalone.backends.aws_ec2.aws_ec2:101 -- AWS EC2 client created - Region: eu-west-1
2024-11-19 09:38:01,433 [DEBUG] lithops.standalone.standalone:69 -- Standalone handler created successfully
2024-11-19 09:38:01,433 [DEBUG] lithops.standalone.keeper:53 -- Starting BudgetKeeper for sm-lithops-prod-032-new (172.31.40.50), instance ID: i-0dab07dbaf9476b7d
2024-11-19 09:38:01,433 [DEBUG] lithops.standalone.keeper:55 -- Delete sm-lithops-prod-032-new on dismantle: False
2024-11-19 09:38:01,433 [DEBUG] lithops.standalone.keeper:72 -- BudgetKeeper started
2024-11-19 09:38:01,434 [DEBUG] lithops.standalone.keeper:75 -- Auto dismantle activated - Soft timeout: 30s, Hard Timeout: 7200s
2024-11-19 09:38:01,434 [DEBUG] lithops.standalone.keeper:108 -- Time to dismantle: 7198 seconds
2024-11-19 09:38:01,434 [INFO] lithops.standalone.master:569 -- Starting job monitoring thread
2024-11-19 09:38:02,436 [DEBUG] lithops.standalone.master:592 -- ExecutorID: e8c616-1 | JObID: M000 - Tasks done: 4/4 - Completed!
2024-11-19 09:38:02,436 [DEBUG] lithops.standalone.master:592 -- ExecutorID: 19b2d8-1 | JObID: M000 - Tasks done: 1/1 - Completed!
2024-11-19 09:38:02,437 [DEBUG] lithops.standalone.master:592 -- ExecutorID: c4030c-1 | JObID: M000 - Tasks done: 1/1 - Completed!
2024-11-19 09:38:03,097 [DEBUG] lithops.standalone.master:541 -- Received job f7800b-6-M000
2024-11-19 09:38:03,097 [DEBUG] lithops.standalone.master:371 -- Going to setup 1 workers
2024-11-19 09:38:03,097 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:58 -- Creating AWS EC2 client
2024-11-19 09:38:03,105 [DEBUG] lithops.standalone.master:522 -- Job f7800b-6-M000 correctly submitted to work queue 'wq:localhost:metaspace2020-metaspace-aws-ec2:3.5.1'
2024-11-19 09:38:03,546 [INFO] lithops.standalone.backends.aws_ec2.aws_ec2:101 -- AWS EC2 client created - Region: eu-west-1
2024-11-19 09:38:03,547 [DEBUG] lithops.standalone.standalone:69 -- Standalone handler created successfully
2024-11-19 09:38:03,547 [DEBUG] lithops.standalone.master:409 -- 1 of 1 workers started for work queue: wq:localhost:metaspace2020-metaspace-aws-ec2:3.5.1
2024-11-19 09:39:02,724 [DEBUG] lithops.standalone.keeper:108 -- Time to dismantle: 7140 seconds
P.S> The code that you changed, I replaced only with ec2 instance. Is this enough or should I change it on the host machine as well? P.S.S> @lmacielvieira JFYI
Based on the worker log, it seems the Redis consumers stop immediately, which means the changes from my last PR are not applied properly.
The code that you changed, I replaced only with ec2 instance. Is this enough or should I change it on the host machine as well?
The Lithops code in the VM instance can be updated (synced) from the host. This means you may lose the changes you made in the VM, which likely happened here. Therefore, I recommend updating the code on the host machine, and run lithops clean ...
. this way, the next time you run a job, the lithops code from your host will be uploaded to the VM
Hello!
From time to time (10-15% of cases) I see a situation when EC2 starts in
consume mode
, job submitted on the work queue, but the worker does not see this job. Here is an example of logs of this situation for master:for worker:
information from Redis:
If I restarted the worker
sudo systemctl restart lithops-worker.service
, I immediately see the beginning of processing the job:I don't know how to reproduce this problem, but I see it from time to time, so I can add additional logging if needed.