intel-analytics / analytics-zoo

Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray
https://analytics-zoo.readthedocs.io/
Apache License 2.0
17 stars 4 forks source link

json.decoder.JSONDecodeError when init orca with num-nodes=4 #281

Open Adria777 opened 3 years ago

Adria777 commented 3 years ago

I try the "/pyzoo/zoo/examples/orca/learn/horovod/pytorch_estimator.py" and change the init_orca to

init_orca_context(cluster_mode="k8s", master="...",
                  container_image="...",
                  num_nodes=4, memory="50g", extra_executor_memory_for_ray="100g", cores=8,
                  conf={"spark.driver.host": "172.16.0.200",
                      "spark.driver.port": "54324",
                      "spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName": "nfsvolumeclaim",
                      "spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path": "/zoo",

                      }, extra_params = {"temp-dir": "/zoo"})

In this case, the num_nodes=4 the spark submit is:

Initializing orca context
Current pyspark location is : /opt/spark/python/lib/pyspark.zip/pyspark/__init__.py
Initializing SparkContext for k8s-client mode
pyspark_submit_args is: --master k8s://https://127.0.0.1:8443 --deploy-mode client --driver-cores 4 --driver-memory 1g --num-executors 4 --executor-cores 8 --executor-memory 50g --driver-class-path /opt/analytics-zoo-0.10.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.12.1-spark_2.4.3-0.10.0-SNAPSHOT-jar-with-dependencies.jar pyspark-shell

there is an json error:

Start to launch ray on cluster
Traceback (most recent call last):
  File "lingqi_pytorch_estimator_4.py", line 122, in <module>
    train_example(workers_per_node=2)
  File "lingqi_pytorch_estimator_4.py", line 91, in train_example
    }, backend="horovod")
  File "/opt/analytics-zoo-0.10.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.12.1-spark_2.4.3-0.10.0-SNAPSHOT-python-api.zip/zoo/orca/learn/pytorch/estimator.py", line 94, in from_torch
  File "/opt/analytics-zoo-0.10.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.12.1-spark_2.4.3-0.10.0-SNAPSHOT-python-api.zip/zoo/orca/learn/pytorch/estimator.py", line 139, in __init__
  File "/opt/analytics-zoo-0.10.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.12.1-spark_2.4.3-0.10.0-SNAPSHOT-python-api.zip/zoo/orca/learn/pytorch/pytorch_ray_estimator.py", line 108, in __init__
  File "/opt/analytics-zoo-0.10.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.12.1-spark_2.4.3-0.10.0-SNAPSHOT-python-api.zip/zoo/ray/raycontext.py", line 390, in get
  File "/opt/analytics-zoo-0.10.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.12.1-spark_2.4.3-0.10.0-SNAPSHOT-python-api.zip/zoo/ray/raycontext.py", line 479, in init
  File "/opt/analytics-zoo-0.10.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.12.1-spark_2.4.3-0.10.0-SNAPSHOT-python-api.zip/zoo/ray/raycontext.py", line 508, in _start_cluster
  File "/opt/analytics-zoo-0.10.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.12.1-spark_2.4.3-0.10.0-SNAPSHOT-python-api.zip/zoo/ray/process.py", line 113, in __init__
  File "/opt/analytics-zoo-0.10.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.12.1-spark_2.4.3-0.10.0-SNAPSHOT-python-api.zip/zoo/ray/process.py", line 120, in print_ray_remote_err_out
Exception: node_ip: 172.30.91.5 tag: raylet, pgid: 101, pids: [], returncode: 1,                 master_addr: None,
 2021-05-14 06:40:27,868        INFO scripts.py:643 -- Local node IP: 172.30.91.5 Traceback (most recent call last):
  File "/root/miniconda3/bin/ray", line 8, in <module>
    sys.exit(main())
  File "/root/miniconda3/lib/python3.7/site-packages/ray/scripts/scripts.py", line 1519, in main
    return cli()
  File "/root/miniconda3/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/root/miniconda3/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/root/miniconda3/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/root/miniconda3/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/root/miniconda3/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/root/miniconda3/lib/python3.7/site-packages/ray/scripts/scripts.py", line 652, in start
    ray_params, head=False, shutdown_at_exit=block, spawn_reaper=block)
  File "/root/miniconda3/lib/python3.7/site-packages/ray/node.py", line 189, in __init__
    "metrics_agent_port", default_port=ray_params.metrics_agent_port)
  File "/root/miniconda3/lib/python3.7/site-packages/ray/node.py", line 592, in _get_cached_port
    ports_by_node.update(json.load(f))
  File "/root/miniconda3/lib/python3.7/json/__init__.py", line 296, in load
    parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
  File "/root/miniconda3/lib/python3.7/json/__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "/root/miniconda3/lib/python3.7/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/root/miniconda3/lib/python3.7/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Stopping orca context

if I change the num-nodes from 4 to 2, it works

init_orca_context(cluster_mode="k8s", master="...",
                  container_image="...",
                  num_nodes=2, memory="50g", extra_executor_memory_for_ray="100g", cores=8,
                  conf={"spark.driver.host": "172.16.0.200",
                      "spark.driver.port": "54324",
                      "spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName": "nfsvolumeclaim",
                      "spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path": "/zoo",

                      }, extra_params = {"temp-dir": "/zoo"})

the spark submit is

Initializing orca context
Current pyspark location is : /opt/spark/python/lib/pyspark.zip/pyspark/__init__.py
Initializing SparkContext for k8s-client mode
pyspark_submit_args is: --master k8s://https://127.0.0.1:8443 --deploy-mode client --driver-cores 4 --driver-memory 1g --num-executors 2 --executor-cores 8 --executor-memory 50g --driver-class-path /opt/analytics-zoo-0.10.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.12.1-spark_2.4.3-0.10.0-SNAPSHOT-jar-with-dependencies.jar pyspark-shell

There is no error if I do not use nfs and set num-nodes to 4.

init_orca_context(cluster_mode="k8s", master="...",
                  container_image="...",
                  num_nodes=4, memory="50g", extra_executor_memory_for_ray="100g", cores=8,
                  conf={"spark.driver.host": "172.16.0.200",
                      "spark.driver.port": "54324"
                      }, extra_params = {"temp-dir": "/zoo"})

It's really strange. Is nfs only support 2 nodes ?

jason-dai commented 3 years ago

@Adria777 please update with the solution to this issue

Adria777 commented 3 years ago

The temp-dir and NFS are not point to the same directory, so the problem occurs. The temp-dir is only used for get worker log. So, we delete "extra_params = {"temp-dir": "/zoo"}" in init_orca_context.

jason-dai commented 3 years ago

A couple of comments: 1) When using init_orca_context, we expect the user to use python ... instead of spark-submit ... to launch the job for client mode? @hkvision

2) What do you mean by temp-dir and NFS not pointing to the same directory? It's not very clear to me why you will have JSONDecodeError when that happens.

hkvision commented 3 years ago

Yes. Except for yarn/k8s cluster mode, users are all recommended to use python directly.

hkvision commented 3 years ago

Why temp-dir is pointing to the NFS not something like /tmp?

glorysdj commented 3 years ago

if we use "temp-dir": "/zoo" and /zoo is mounted to a nfs storage, multi-executors will write to the same physical folder, and this will cause conflicts, which is thrown as JSONDecodeError by raylet. to avoid this, option 1, don't mount temp-dir to shared storage. But if users need to debug ray logs on k8s, they may need to output logs to a shared storage since executor pod will be cleared very quickly and logs in it will be lost. option 2, mount temp-dir to shared storage, if we start ray with temp-dir and append some random info after temp-dir like /zoo/raytempXXX may help to resolve the conflicts. option 3, use k8s local storage volume, which will use emptyDir of k8s and is using ephemeral storage feature of Kubernetes(and do not persist beyond the life of the pod), we need to verify if this works.

option 1 is already added in docmument.

jason-dai commented 3 years ago

if we use "temp-dir": "/zoo" and /zoo is mounted to a nfs storage, multi-executors will write to the same physical folder, and this will cause conflicts, which is thrown as JSONDecodeError by raylet. to avoid this, option 1, don't mount temp-dir to shared storage. But if users need to debug ray logs on k8s, they may need to output logs to a shared storage since executor pod will be cleared very quickly and logs in it will be lost. option 2, mount temp-dir to shared storage, if we start ray with temp-dir and append some random info after temp-dir like /zoo/raytempXXX may help to resolve the conflicts. option 3, use k8s local storage volume, which will use emptyDir of k8s and is using ephemeral storage feature of Kubernetes(and do not persist beyond the life of the pod), we need to verify if this works.

option 1 is already added in docmument.

Seems option 2 is a better solutions? @hkvision

glorysdj commented 3 years ago

and option 3 should be removed with a strong reason, if multi-executors are scheduled on a same node, even with k8s local storage, same temp-dir will also cause conflicts.