lithops-cloud / lithops

A multi-cloud framework for big data analytics and embarrassingly parallel jobs, that provides an universal API for building parallel applications in the cloud ☁️🚀
http://lithops.cloud
Apache License 2.0
319 stars 105 forks source link

Periodic problems with a job that does not start #1401

Open sergii-mamedov opened 3 weeks ago

sergii-mamedov commented 3 weeks ago

Hello!

From time to time (10-15% of cases) I see a situation when EC2 starts in consume mode, job submitted on the work queue, but the worker does not see this job. Here is an example of logs of this situation for master:

$ tail -f -n 100 /tmp/lithops-*/master-service.log
2024-10-31 11:04:42,186 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:58 -- Creating AWS EC2 client
2024-10-31 11:04:42,974 [INFO] lithops.standalone.backends.aws_ec2.aws_ec2:101 -- AWS EC2 client created - Region: eu-west-1
2024-10-31 11:04:42,975 [DEBUG] lithops.standalone.standalone:69 -- Standalone handler created successfully
2024-10-31 11:04:42,976 [DEBUG] lithops.standalone.keeper:53 -- Starting BudgetKeeper for sm-lithops-prod-032-new (172.31.40.50), instance ID: i-0dab07dbaf9476b7d
2024-10-31 11:04:42,976 [DEBUG] lithops.standalone.keeper:55 -- Delete sm-lithops-prod-032-new on dismantle: False
2024-10-31 11:04:42,976 [DEBUG] lithops.standalone.keeper:72 -- BudgetKeeper started
2024-10-31 11:04:42,976 [DEBUG] lithops.standalone.keeper:75 -- Auto dismantle activated - Soft timeout: 30s, Hard Timeout: 7200s
2024-10-31 11:04:42,976 [DEBUG] lithops.standalone.keeper:108 -- Time to dismantle: 7198 seconds
2024-10-31 11:04:42,976 [INFO] lithops.standalone.master:569 -- Starting job monitoring thread
2024-10-31 11:04:43,979 [DEBUG] lithops.standalone.master:592 -- ExecutorID: 8f3b98-11 | JObID: M000 - Tasks done: 1/1 - Completed!
2024-10-31 11:04:43,979 [DEBUG] lithops.standalone.master:592 -- ExecutorID: 4538f6-6 | JObID: M000 - Tasks done: 1/1 - Completed!
2024-10-31 11:04:44,635 [DEBUG] lithops.standalone.master:541 -- Received job 4538f6-11-M000
2024-10-31 11:04:44,636 [DEBUG] lithops.standalone.master:371 -- Going to setup 1 workers
2024-10-31 11:04:44,636 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:58 -- Creating AWS EC2 client
2024-10-31 11:04:44,643 [DEBUG] lithops.standalone.master:522 -- Job 4538f6-11-M000 correctly submitted to work queue 'wq:localhost:metaspace2020-metaspace-aws-ec2:3.5.1'
2024-10-31 11:04:45,077 [INFO] lithops.standalone.backends.aws_ec2.aws_ec2:101 -- AWS EC2 client created - Region: eu-west-1
2024-10-31 11:04:45,078 [DEBUG] lithops.standalone.standalone:69 -- Standalone handler created successfully
2024-10-31 11:04:45,079 [DEBUG] lithops.standalone.master:409 -- 1 of 1 workers started for work queue: wq:localhost:metaspace2020-metaspace-aws-ec2:3.5.1
2024-10-31 11:05:44,260 [DEBUG] lithops.standalone.keeper:108 -- Time to dismantle: 7140 seconds

for worker:

$ tail -f /tmp/lithops-root/worker-service.log 
2024-10-31 11:04:44,095 [INFO] lithops.standalone.worker:263 -- Starting Worker - Instace type: r6a.xlarge - Runtime name: metaspace2020/metaspace-aws-ec2:3.5.1 - Worker processes: 4
2024-10-31 11:04:44,096 [INFO] lithops.standalone.worker:154 -- Redis consumer process 0 started
2024-10-31 11:04:44,096 [INFO] lithops.standalone.worker:154 -- Redis consumer process 1 started
2024-10-31 11:04:44,096 [INFO] lithops.standalone.worker:154 -- Redis consumer process 2 started
2024-10-31 11:04:44,097 [INFO] lithops.standalone.worker:154 -- Redis consumer process 3 started
2024-10-31 11:04:45,097 [INFO] lithops.standalone.worker:219 -- Redis consumer process 0 finished
2024-10-31 11:04:45,097 [INFO] lithops.standalone.worker:219 -- Redis consumer process 1 finished
2024-10-31 11:04:45,097 [INFO] lithops.standalone.worker:219 -- Redis consumer process 2 finished
2024-10-31 11:04:45,098 [INFO] lithops.standalone.worker:219 -- Redis consumer process 3 finished
2024-10-31 11:04:45,099 [DEBUG] lithops.standalone.worker:289 -- Worker service finished

information from Redis:

>>> import redis
>>> client = redis.Redis()
>>> client.keys()
[b'job:8f3b98-11-M000', b'tasksdone:8f3b98-11-M000', b'tasksdone:4538f6-6-M000', b'worker:sm-lithops-prod-032-new', b'job:4538f6-11-M000', b'job:4538f6-6-M000', b'wq:localhost:metaspace2020-metaspace-aws-ec2:3.5.1']
>>> client.hget('job:4538f6-11-M000', 'status')
b'submitted'

If I restarted the worker sudo systemctl restart lithops-worker.service, I immediately see the beginning of processing the job:

$ tail -f /tmp/lithops-root/worker-service.log 
2024-10-31 11:04:45,097 [INFO] lithops.standalone.worker:219 -- Redis consumer process 1 finished
2024-10-31 11:04:45,097 [INFO] lithops.standalone.worker:219 -- Redis consumer process 2 finished
2024-10-31 11:04:45,098 [INFO] lithops.standalone.worker:219 -- Redis consumer process 3 finished
2024-10-31 11:04:45,099 [DEBUG] lithops.standalone.worker:289 -- Worker service finished
2024-10-31 11:17:54,146 [INFO] lithops.standalone.worker:263 -- Starting Worker - Instace type: r6a.xlarge - Runtime name: metaspace2020/metaspace-aws-ec2:3.5.1 - Worker processes: 4
2024-10-31 11:17:54,146 [INFO] lithops.standalone.worker:154 -- Redis consumer process 0 started
2024-10-31 11:17:54,146 [INFO] lithops.standalone.worker:154 -- Redis consumer process 1 started
2024-10-31 11:17:54,147 [INFO] lithops.standalone.worker:154 -- Redis consumer process 2 started
2024-10-31 11:17:54,147 [DEBUG] lithops.standalone.worker:183 -- ExecutorID 4538f6-11 | JobID M000 - Running CallID 00000 in the local worker (consumer 0)
2024-10-31 11:17:54,147 [INFO] lithops.standalone.worker:154 -- Redis consumer process 3 started
2024-10-31 11:19:03,524 [DEBUG] lithops.standalone.worker:213 -- ExecutorID 4538f6-11 | JobID M000 - CallID 00000 execution finished

I don't know how to reproduce this problem, but I see it from time to time, so I can add additional logging if needed.

JosepSampe commented 3 weeks ago

Hi @sergii-mamedov, thanks for the detailed description of the issue. I reviewed the code and identified a potential problem. I’ll work on fixing the logic, test it, and then submit a PR.

JosepSampe commented 2 weeks ago

@sergii-mamedov I added a fix for this

sergii-mamedov commented 1 week ago

@JosepSampe We may have fewer of these cases, but I still see this problem. Here is an example of logs:

$ tail /tmp/lithops-root/worker-service.log 
2024-11-19 09:38:02,729 [INFO] lithops.standalone.worker:263 -- Starting Worker - Instace type: r6a.xlarge - Runtime name: metaspace2020/metaspace-aws-ec2:3.5.1 - Worker processes: 4
2024-11-19 09:38:02,729 [INFO] lithops.standalone.worker:154 -- Redis consumer process 0 started
2024-11-19 09:38:02,730 [INFO] lithops.standalone.worker:154 -- Redis consumer process 1 started
2024-11-19 09:38:02,730 [INFO] lithops.standalone.worker:154 -- Redis consumer process 2 started
2024-11-19 09:38:02,731 [INFO] lithops.standalone.worker:154 -- Redis consumer process 3 started
2024-11-19 09:38:03,731 [INFO] lithops.standalone.worker:219 -- Redis consumer process 0 finished
2024-11-19 09:38:03,731 [INFO] lithops.standalone.worker:219 -- Redis consumer process 1 finished
2024-11-19 09:38:03,732 [INFO] lithops.standalone.worker:219 -- Redis consumer process 2 finished
2024-11-19 09:38:03,732 [INFO] lithops.standalone.worker:219 -- Redis consumer process 3 finished
2024-11-19 09:38:03,732 [DEBUG] lithops.standalone.worker:289 -- Worker service finished

and master

$ more /tmp/lithops-root/master-service.log 
2024-11-19 09:38:00,517 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:58 -- Creating AWS EC2 client
2024-11-19 09:38:01,431 [INFO] lithops.standalone.backends.aws_ec2.aws_ec2:101 -- AWS EC2 client created - Region: eu-west-1
2024-11-19 09:38:01,433 [DEBUG] lithops.standalone.standalone:69 -- Standalone handler created successfully
2024-11-19 09:38:01,433 [DEBUG] lithops.standalone.keeper:53 -- Starting BudgetKeeper for sm-lithops-prod-032-new (172.31.40.50), instance ID: i-0dab07dbaf9476b7d
2024-11-19 09:38:01,433 [DEBUG] lithops.standalone.keeper:55 -- Delete sm-lithops-prod-032-new on dismantle: False
2024-11-19 09:38:01,433 [DEBUG] lithops.standalone.keeper:72 -- BudgetKeeper started
2024-11-19 09:38:01,434 [DEBUG] lithops.standalone.keeper:75 -- Auto dismantle activated - Soft timeout: 30s, Hard Timeout: 7200s
2024-11-19 09:38:01,434 [DEBUG] lithops.standalone.keeper:108 -- Time to dismantle: 7198 seconds
2024-11-19 09:38:01,434 [INFO] lithops.standalone.master:569 -- Starting job monitoring thread
2024-11-19 09:38:02,436 [DEBUG] lithops.standalone.master:592 -- ExecutorID: e8c616-1 | JObID: M000 - Tasks done: 4/4 - Completed!
2024-11-19 09:38:02,436 [DEBUG] lithops.standalone.master:592 -- ExecutorID: 19b2d8-1 | JObID: M000 - Tasks done: 1/1 - Completed!
2024-11-19 09:38:02,437 [DEBUG] lithops.standalone.master:592 -- ExecutorID: c4030c-1 | JObID: M000 - Tasks done: 1/1 - Completed!
2024-11-19 09:38:03,097 [DEBUG] lithops.standalone.master:541 -- Received job f7800b-6-M000
2024-11-19 09:38:03,097 [DEBUG] lithops.standalone.master:371 -- Going to setup 1 workers
2024-11-19 09:38:03,097 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:58 -- Creating AWS EC2 client
2024-11-19 09:38:03,105 [DEBUG] lithops.standalone.master:522 -- Job f7800b-6-M000 correctly submitted to work queue 'wq:localhost:metaspace2020-metaspace-aws-ec2:3.5.1'
2024-11-19 09:38:03,546 [INFO] lithops.standalone.backends.aws_ec2.aws_ec2:101 -- AWS EC2 client created - Region: eu-west-1
2024-11-19 09:38:03,547 [DEBUG] lithops.standalone.standalone:69 -- Standalone handler created successfully
2024-11-19 09:38:03,547 [DEBUG] lithops.standalone.master:409 -- 1 of 1 workers started for work queue: wq:localhost:metaspace2020-metaspace-aws-ec2:3.5.1
2024-11-19 09:39:02,724 [DEBUG] lithops.standalone.keeper:108 -- Time to dismantle: 7140 seconds

P.S> The code that you changed, I replaced only with ec2 instance. Is this enough or should I change it on the host machine as well? P.S.S> @lmacielvieira JFYI

JosepSampe commented 4 days ago

Based on the worker log, it seems the Redis consumers stop immediately, which means the changes from my last PR are not applied properly.

The code that you changed, I replaced only with ec2 instance. Is this enough or should I change it on the host machine as well?

The Lithops code in the VM instance can be updated (synced) from the host. This means you may lose the changes you made in the VM, which likely happened here. Therefore, I recommend updating the code on the host machine, and run lithops clean .... this way, the next time you run a job, the lithops code from your host will be uploaded to the VM