Open Adria777 opened 3 years ago
@Adria777 please update with the solution to this issue
The temp-dir and NFS are not point to the same directory, so the problem occurs. The temp-dir is only used for get worker log. So, we delete "extra_params = {"temp-dir": "/zoo"}" in init_orca_context.
A couple of comments:
1) When using init_orca_context
, we expect the user to use python ...
instead of spark-submit ...
to launch the job for client mode? @hkvision
2) What do you mean by temp-dir and NFS not pointing to the same directory? It's not very clear to me why you will have JSONDecodeError when that happens.
Yes. Except for yarn/k8s cluster mode, users are all recommended to use python
directly.
Why temp-dir is pointing to the NFS not something like /tmp?
if we use "temp-dir": "/zoo" and /zoo is mounted to a nfs storage, multi-executors will write to the same physical folder, and this will cause conflicts, which is thrown as JSONDecodeError by raylet. to avoid this, option 1, don't mount temp-dir to shared storage. But if users need to debug ray logs on k8s, they may need to output logs to a shared storage since executor pod will be cleared very quickly and logs in it will be lost. option 2, mount temp-dir to shared storage, if we start ray with temp-dir and append some random info after temp-dir like /zoo/raytempXXX may help to resolve the conflicts. option 3, use k8s local storage volume, which will use emptyDir of k8s and is using ephemeral storage feature of Kubernetes(and do not persist beyond the life of the pod), we need to verify if this works.
option 1 is already added in docmument.
if we use "temp-dir": "/zoo" and /zoo is mounted to a nfs storage, multi-executors will write to the same physical folder, and this will cause conflicts, which is thrown as JSONDecodeError by raylet. to avoid this, option 1, don't mount temp-dir to shared storage. But if users need to debug ray logs on k8s, they may need to output logs to a shared storage since executor pod will be cleared very quickly and logs in it will be lost. option 2, mount temp-dir to shared storage, if we start ray with temp-dir and append some random info after temp-dir like /zoo/raytempXXX may help to resolve the conflicts. option 3, use k8s local storage volume, which will use emptyDir of k8s and is using ephemeral storage feature of Kubernetes(and do not persist beyond the life of the pod), we need to verify if this works.
option 1 is already added in docmument.
Seems option 2 is a better solutions? @hkvision
and option 3 should be removed with a strong reason, if multi-executors are scheduled on a same node, even with k8s local storage, same temp-dir will also cause conflicts.
I try the "/pyzoo/zoo/examples/orca/learn/horovod/pytorch_estimator.py" and change the init_orca to
In this case, the num_nodes=4 the spark submit is:
there is an json error:
if I change the num-nodes from 4 to 2, it works
the spark submit is
There is no error if I do not use nfs and set num-nodes to 4.
It's really strange. Is nfs only support 2 nodes ?