Open hkvision opened 2 years ago
Using spark-submit also gets the same error:
bin/spark-submit \
--master k8s://https://A.B.C.D:X \
--deploy-mode client \
--conf spark.driver.host=172.16.0.170 \
--conf spark.driver.port=54321 \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.container.image=10.239.45.10/arda/intelanalytics/bigdl-k8s-spark-3.1.2:0.14.0-SNAPSHOT \
--conf spark.executor.instances=1 \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName=nfsvolumeclaim \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path=/bigdl2.0/data \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName=nfsvolumeclaim \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path=/bigdl2.0/data \
--conf spark.kubernetes.executor.deleteOnTermination=false \
--conf spark.kubernetes.driverEnv.http_proxy=http://A.B.C.D:X \
--conf spark.kubernetes.driverEnv.https_proxy=http://A.B.C.D:X \
--conf spark.kubernetes.executorEnv.http_proxy=http://A.B.C.D:X \
--conf spark.kubernetes.executorEnv.https_proxy=http://A.B.C.D:X \
--conf spark.kubernetes.container.image.pullPolicy=Always \
--conf spark.pyspark.driver.python=/root/anaconda3/envs/orca-demo/bin/python \
--conf spark.pyspark.python=/usr/local/envs/pytf1/bin/python \
--conf spark.executorEnv.PYTHONHOME=/usr/local/envs/pytf1 \
--executor-cores 4 \
--executor-memory 50g \
--total-executor-cores 16 \
--driver-cores 4 \
--driver-memory 50g \
--properties-file /opt/bigdl-0.14.0-SNAPSHOT/conf/spark-bigdl.conf \
--conf spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp \
--conf spark.sql.catalogImplementation='in-memory' \
--conf spark.driver.extraClassPath=local:///opt/bigdl-0.14.0-SNAPSHOT/jars/* \
--conf spark.executor.extraClassPath=local:///opt/bigdl-0.14.0-SNAPSHOT/jars/* \
/root/kai/ncf_orca.py
Rolling back the docker image to 1122 also fails. Tried on 188 and 170 and both gets the same error.
basic_text_classification can run successfully. After further test, the issue occurs when the embedding is (relatively) large; for the basic text classification example, vocab size is 10000, if I change to 100000, it fails... Enlarging memory doesn't resolve this issue. The error raises from https://github.com/intel-analytics/BigDL/blob/0495db13f1ab6ab38e2cc650ea4f2617bae8b688/scala/dllib/src/main/scala/com/intel/analytics/bigdl/dllib/keras/models/Topology.scala#L1471 when getModel after the training:
(0 until extraParamLength).foreach(i =>
extraState(i) = models.map(_.localModels.head.getExtraParameter()(i)).first()
)
Seems the block manager get killed when trying to fetch the extra parameters?
The only task error I can find from the webUI is TaskResultLost (result lost from block manager)
Any ideas @qiuxin2012 @yangw1234 @jason-dai
Running in cluster node from remote is successful:
bin/spark-submit --master k8s://https://172.16.0.200:6443 --deploy-mode cluster --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --name bigdl2-basic_text_classification --conf spark.kubernetes.container.image=10.239.45.10/arda/intelanalytics/bigdl-k8s-spark-3.1.2:0.14.0-SNAPSHOT --conf spark.executor.instances=1 --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName=nfsvolumeclaim --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path=/bigdl2.0/data --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName=nfsvolumeclaim --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path=/bigdl2.0/data --conf spark.kubernetes.driver.label.az=true --conf spark.kubernetes.executor.label.az=true --conf spark.kubernetes.node.selector.spark=true --conf spark.kubernetes.driverEnv.http_proxy=http://child-prc.intel.com:913 --conf spark.kubernetes.driverEnv.https_proxy=http://child-prc.intel.com:913 --conf spark.kubernetes.executorEnv.http_proxy=http://child-prc.intel.com:913 --conf spark.kubernetes.executorEnv.https_proxy=http://child-prc.intel.com:913 --conf spark.kubernetes.container.image.pullPolicy=Always --conf spark.pyspark.driver.python=/usr/local/envs/pytf1/bin/python --conf spark.pyspark.python=/usr/local/envs/pytf1/bin/python --conf spark.executorEnv.PYTHONHOME=/usr/local/envs/pytf1 --executor-cores 4 --executor-memory 50g --total-executor-cores 16 --driver-cores 4 --driver-memory 50g --properties-file /opt/bigdl-0.14.0-SNAPSHOT/conf/spark-bigdl.conf --py-files local:///opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip,local:///bigdl2.0/data/kai/ncf_orca.py --conf spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp --conf spark.sql.catalogImplementation='in-memory' --conf spark.driver.extraClassPath=local:///opt/bigdl-0.14.0-SNAPSHOT/jars/* --conf spark.executor.extraClassPath=local:///opt/bigdl-0.14.0-SNAPSHOT/jars/* local:///bigdl2.0/data/kai/ncf_orca.py
1) Looks to me like a memory issue; which memory did you increase?
2) Can you try running client mode on the master mode?
3) Is there is way to monitor the status of each executor node and the driver node?
4) We probably shouldn't use first
:
* @note This method should only be used if the resulting array is expected to be small, as
* all the data is loaded into the driver's memory.
- Looks to me like a memory issue; which memory did you increase?
- Can you try running client mode on the master mode?
- Is there is way to monitor the status of each executor node and the driver node?
- We probably shouldn't use
first
:* @note This method should only be used if the resulting array is expected to be small, as * all the data is loaded into the driver's memory.
- Looks to me like a memory issue; which memory did you increase?
- Can you try running client mode on the master mode?
- Is there is way to monitor the status of each executor node and the driver node?
- We probably shouldn't use
first
:* @note This method should only be used if the resulting array is expected to be small, as * all the data is loaded into the driver's memory.
- I increased both the driver memory and executor memory to 50G and 120G but still doesn't work. Doubt whether it is related to memory... probably the memory issue may not be the root cause since embedding of this size isn't that large...
Is it related to K8s memory config for the containers?
- I tried on Almaren-200 and it works. It also works on cluster mode from remote.
In both cases, the drive is running in the K8s cluster, yes?
- Need some investigation.
- I will check the code. Also at the same time @pinggao187 is setting up a new cluster for test.
Is it related to K8s memory config for the containers?
No memory constraint from the cluster according to @pinggao187
In both cases, the drive is running in the K8s cluster, yes?
Yes... So I doubt whether the issue is related to the internal ip (same as RayOnSpark), but don't know why basic text classification is successful and also I succeeded previously...
- Looks to me like a memory issue; which memory did you increase?
- Can you try running client mode on the master mode?
- Is there is way to monitor the status of each executor node and the driver node?
- We probably shouldn't use
first
:* @note This method should only be used if the resulting array is expected to be small, as * all the data is loaded into the driver's memory.
For 4, checked the code, optimizer.optimize will return the trained model, not only first
is called for extra parameters, but also reduce
is called for weights and gradients. Tried dllib keras also gets this error.
Yes... So I doubt whether the issue is related to the internal ip (same as RayOnSpark), but don't know why basic text classification is successful and also I succeeded previously...
If the result of a Task is small, it will be directly sent back to the driver; otherwise, it will be stored in BlockManager and the driver need to fetch it from BlockManager. Therefore the behavior will be different when embedding size becomes larger.
Yes... So I doubt whether the issue is related to the internal ip (same as RayOnSpark), but don't know why basic text classification is successful and also I succeeded previously...
If the result of a Task is small, it will be directly sent back to the driver; otherwise, it will be stored in BlockManager and the driver need to fetch it from BlockManager. Therefore the behavior will be different when embedding size becomes larger.
After setting "spark.task.maxDirectResultSize": "100000000"
in conf, can run on the small dataset with ~20w items. But I think this is just a workaround? The main issue is why the data in the BlockManager gets lost unexpectedly?
@jason-dai You are right. Now the root cause is the driver can't connect to the block manager. Previously I succeeded on ml-1m dataset, which only has 3000+ items... while ml-latest-small dataset has far fewer records but actually has far more items... Sorry that I overlooked this... I tried collecting some data to the driver and it gets the same error as well. Looking into how to solve this issue.
This issue should be the same as #3605
Run on Almaren-Node-188 cd kai conda activate orca-demo export PYSPARK_PYTHON=/usr/local/envs/pytf1/bin/python python ncf_orca.py
get the error below when the training is finished:
Seems 172.30.96.12 is an internal ip. Previously there is no error, not sure what has happened during the past week for the docker image and the k8s cluster. cc @pinggao18