Closed kadisi closed 1 month ago
@architkulkarni can you please triage?
@kadisi Does this happen every time, or only sometimes? I'm not sure if it's possible to get more information about why the actor died, since the error "The actor died unexpectedly before finishing this task." is very general (@jjyao do you know?), but if you could zip up and attach the logs, that would potentially be helpful.
The AttributeError: 'NoneType' object has no attribute 'status'
is a separate issue which I believe shouldn't actually cause the job to fail. The error message is the same as https://github.com/ray-project/ray/issues/29632 so the root cause might be similar.
Does this happen every time, or only sometimes? I'm not sure if it's possible to get more information about why the actor died, since the error "The actor died unexpectedly before finishing this task." is very general (@jjyao do you know?), but if you could zip up and attach the logs, that would potentially be helpful.
The
AttributeError: 'NoneType' object has no attribute 'status'
is a separate issue which I believe shouldn't actually cause the job to fail. The error message is the same as #29632 so the root cause might be similar.
it happened every time.
@kadisi Thanks for the info, it would be helpful if you could share a zip of the logs on each node. By default they're in /tmp/ray/session_latest/logs
on each node.
What happened + What you expected to happen
when i exec :
RAY_ADDRESS='http://127.0.0.1:8265' ray job submit --working-dir . -- python demo.py
it out put job failedVersions / Dependencies
ray: 2.7.1 ubuntu: 22.04 5.15.0-83-generi python: 3.10.12 Driver Version: 545.23.06
CUDA Version: 12.3 aiohttp: 3.8.6
Reproduction script
i have 3 ubuntu vms:
head node(172.20.159.147):
ray start --head --num-cpus=0 --num-gpus=0 --node-ip-address=0.0.0.0 --dashboard-host=0.0.0.0 -v
work node1(172.20.159.145):
ray start --address='172.20.159.147:6379'
work node2(172.20.159.146):ray start --address='172.20.159.147:6379'
on head node ,i exec these commands:
the contents of demo.py is
there is some error from /tmp/ray/session_latest/logs/dashboard.log
but when i exec demo.py on head node , it is ok:
Issue Severity
None