intel-analytics / analytics-zoo

Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray
https://analytics-zoo.readthedocs.io/
Apache License 2.0
16 stars 3 forks source link

tf2 estimator failed with horovod backend if data_creator is tf.data.Dataset from generator #168

Open jenniew opened 3 years ago

jenniew commented 3 years ago

TF2Estimator failed with horovod backend if data_creator is tf.data.Dataset from generator, the error is as below:

2021-07-21 04:17:10,169 WARNING worker.py:1107 -- A worker died or was killed while executing task ffffffffffffffff63964fa4841d4a2ecb45751801000000. Traceback (most recent call last): File "wnd_train_tf2_generator_horovod.py", line 449, in validation_steps=test_steps) File "/root/anaconda3/envs/py37-horovod-tf/lib/python3.7/site-packages/zoo/orca/learn/tf2/estimator.py", line 257, in fit for i in range(self.num_workers)]) File "/root/anaconda3/envs/py37-horovod-tf/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper return func(*args, *kwargs) File "/root/anaconda3/envs/py37-horovod-tf/lib/python3.7/site-packages/ray/worker.py", line 1458, in get raise value ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. Check python-core-worker-.log files for more information.

yangw1234 commented 3 years ago

Would you mind provide more information?

  1. what's the code to reproduce the problem
  2. what's software version you are using including analytics zoo, horovod, tensorflow
  3. what's cluster configuration you are using, e.g. memory, cores number, nodes number
jenniew commented 3 years ago

1.The code is https://github.com/jenniew/friesian/blob/wnd_train_twitter/Training/WideDeep/twitter/wnd_train_tf2_generator_horovod.py

  1. tensorflow 2.3.0, latest zoo, horovod:0.19.2, ray:1.2.0
  2. driver_cores :10 driver_memory: 30g num_executor: 8 executor_cores: 10 executor_memory: 30g
yangw1234 commented 3 years ago

1.The code is https://github.com/jenniew/friesian/blob/wnd_train_twitter/Training/WideDeep/twitter/wnd_train_tf2_generator_horovod.py

  1. tensorflow 2.3.0, latest zoo, horovod:0.19.2, ray:1.2.0
  2. driver_cores :10 driver_memory: 30g num_executor: 8 executor_cores: 10 executor_memory: 30g

Thanks, we try to reproduce.

yangw1234 commented 3 years ago

@jenniew where to find the data?

yangw1234 commented 3 years ago

@leonardozcm could you help me take a look at this issue?

jenniew commented 3 years ago

You may use the data: hdfs://172.16.0.105:8020/user/root/jwang/wnd_twitter_2

leonardozcm commented 3 years ago

You may use the data: hdfs://172.16.0.105:8020/user/root/jwang/wnd_twitter_2

I will take a look. Since it's a private repo, would you mind giving me permission?

jenniew commented 3 years ago

You may use the data: hdfs://172.16.0.105:8020/user/root/jwang/wnd_twitter_2

I will take a look. Since it's a private repo, would you mind giving me permission?

Yes, I already sent invitation to you.

leonardozcm commented 3 years ago

You may use the data: hdfs://172.16.0.105:8020/user/root/jwang/wnd_twitter_2

I will take a look. Since it's a private repo, would you mind giving me permission?

Yes, I already sent invitation to you.

OK, thanks a lot.

leonardozcm commented 3 years ago

tensorflow.python.framework.errors_impl.InvalidArgumentError: Received a label value of 1 which is outside the valid range of [0, 1). Label values: 1 1 0 1 1 1

do I miss something doing with softmax?

leonardozcm commented 3 years ago

Change loss function to binary_crossentropy, and I didn't make it to reproduce this issue. 9/13632 [..............................] - ETA: 5:47:25 - loss: 0.6880 - accuracy: 0.5741

leonardozcm commented 3 years ago
(pid=157408, ip=172.16.0.146) Global rank:  6
(pid=157408, ip=172.16.0.146) Total workers:  8
(pid=157408, ip=172.16.0.146) Number of files for worker:  8
(pid=157408, ip=172.16.0.146) Data size for worker:  671325
(pid=157408, ip=172.16.0.146) Loading hdfs://172.16.0.105:8020/user/root/jwang/wnd_twitter_2/train_parquet/part-00042-cbd17f77-8da4-45c7-9031-919a6d619098-c000.snappy.parquet of size 90332
(pid=193705, ip=172.16.0.148) Global rank:  1
(pid=193705, ip=172.16.0.148) Total workers:  8
(pid=193705, ip=172.16.0.148) Number of files for worker:  8
(pid=193705, ip=172.16.0.148) Data size for worker:  751492
(pid=193705, ip=172.16.0.148) Loading hdfs://172.16.0.105:8020/user/root/jwang/wnd_twitter_2/train_parquet/part-00007-cbd17f77-8da4-45c7-9031-919a6d619098-c000.snappy.parquet of size 95630
(pid=57625, ip=172.16.0.159) Global rank:  2
(pid=57625, ip=172.16.0.159) Total workers:  8
(pid=57625, ip=172.16.0.159) Number of files for worker:  8
(pid=57625, ip=172.16.0.159) Data size for worker:  741991
(pid=57625, ip=172.16.0.159) Loading hdfs://172.16.0.105:8020/user/root/jwang/wnd_twitter_2/train_parquet/part-00014-cbd17f77-8da4-45c7-9031-919a6d619098-c000.snappy.parquet of size 94079
(pid=195572, ip=172.16.0.141) Global rank:  0
(pid=195572, ip=172.16.0.141) Total workers:  8
(pid=195572, ip=172.16.0.141) Number of files for worker:  8
(pid=195572, ip=172.16.0.141) Data size for worker:  769372
(pid=195572, ip=172.16.0.141) Loading hdfs://172.16.0.105:8020/user/root/jwang/wnd_twitter_2/train_parquet/part-00000-cbd17f77-8da4-45c7-9031-919a6d619098-c000.snappy.parquet of size 99895
(pid=150683, ip=172.16.0.129) Global rank:  3
(pid=150683, ip=172.16.0.129) Total workers:  8
(pid=150683, ip=172.16.0.129) Number of files for worker:  8
(pid=150683, ip=172.16.0.129) Data size for worker:  734900
(pid=150683, ip=172.16.0.129) Loading hdfs://172.16.0.105:8020/user/root/jwang/wnd_twitter_2/train_parquet/part-00021-cbd17f77-8da4-45c7-9031-919a6d619098-c000.snappy.parquet of size 92860
(pid=57625, ip=172.16.0.159) wnd_train_tf2_generator_horovod.py:156: FutureWarning: pyarrow.hdfs.connect is deprecated as of 2.0.0, please use pyarrow.fs.HadoopFileSystem instead.
(pid=193705, ip=172.16.0.148) wnd_train_tf2_generator_horovod.py:156: FutureWarning: pyarrow.hdfs.connect is deprecated as of 2.0.0, please use pyarrow.fs.HadoopFileSystem instead.
(pid=57625, ip=172.16.0.159) Global rank:  2
(pid=57625, ip=172.16.0.159) Total workers:  8
(pid=57625, ip=172.16.0.159) Number of files for worker:  8
(pid=57625, ip=172.16.0.159) Data size for worker:  39095
(pid=57625, ip=172.16.0.159) Loading hdfs://172.16.0.105:8020/user/root/jwang/wnd_twitter_2/test_parquet/part-00014-a19e97e6-903f-4c9c-90f2-ba2327ba4171-c000.snappy.parquet of size 4952
(pid=157408, ip=172.16.0.146) wnd_train_tf2_generator_horovod.py:156: FutureWarning: pyarrow.hdfs.connect is deprecated as of 2.0.0, please use pyarrow.fs.HadoopFileSystem instead.
(pid=175321, ip=172.16.0.109) Global rank:  5
(pid=175321, ip=172.16.0.109) Total workers:  8
(pid=175321, ip=172.16.0.109) Number of files for worker:  8
(pid=175321, ip=172.16.0.109) Data size for worker:  721226
(pid=175321, ip=172.16.0.109) Loading hdfs://172.16.0.105:8020/user/root/jwang/wnd_twitter_2/train_parquet/part-00035-cbd17f77-8da4-45c7-9031-919a6d619098-c000.snappy.parquet of size 91049
(pid=193705, ip=172.16.0.148) Global rank:  1
(pid=193705, ip=172.16.0.148) Total workers:  8
(pid=193705, ip=172.16.0.148) Number of files for worker:  8
(pid=193705, ip=172.16.0.148) Data size for worker:  39777
(pid=193705, ip=172.16.0.148) Loading hdfs://172.16.0.105:8020/user/root/jwang/wnd_twitter_2/test_parquet/part-00007-a19e97e6-903f-4c9c-90f2-ba2327ba4171-c000.snappy.parquet of size 4986
(pid=157408, ip=172.16.0.146) Global rank:  6
(pid=157408, ip=172.16.0.146) Total workers:  8
(pid=157408, ip=172.16.0.146) Number of files for worker:  8
(pid=157408, ip=172.16.0.146) Data size for worker:  35231
(pid=157408, ip=172.16.0.146) Loading hdfs://172.16.0.105:8020/user/root/jwang/wnd_twitter_2/test_parquet/part-00042-a19e97e6-903f-4c9c-90f2-ba2327ba4171-c000.snappy.parquet of size 4724
(pid=150683, ip=172.16.0.129) wnd_train_tf2_generator_horovod.py:156: FutureWarning: pyarrow.hdfs.connect is deprecated as of 2.0.0, please use pyarrow.fs.HadoopFileSystem instead.
(pid=150683, ip=172.16.0.129) Global rank:  3
(pid=150683, ip=172.16.0.129) Total workers:  8
(pid=150683, ip=172.16.0.129) Number of files for worker:  8
(pid=150683, ip=172.16.0.129) Data size for worker:  38422
(pid=150683, ip=172.16.0.129) Loading hdfs://172.16.0.105:8020/user/root/jwang/wnd_twitter_2/test_parquet/part-00021-a19e97e6-903f-4c9c-90f2-ba2327ba4171-c000.snappy.parquet of size 5037
(pid=195572, ip=172.16.0.141) wnd_train_tf2_generator_horovod.py:156: FutureWarning: pyarrow.hdfs.connect is deprecated as of 2.0.0, please use pyarrow.fs.HadoopFileSystem instead.
(pid=195572, ip=172.16.0.141) Global rank:  0
(pid=195572, ip=172.16.0.141) Total workers:  8
(pid=195572, ip=172.16.0.141) Number of files for worker:  8
(pid=195572, ip=172.16.0.141) Data size for worker:  40863
(pid=195572, ip=172.16.0.141) Loading hdfs://172.16.0.105:8020/user/root/jwang/wnd_twitter_2/test_parquet/part-00000-a19e97e6-903f-4c9c-90f2-ba2327ba4171-c000.snappy.parquet of size 5307
(pid=195572, ip=172.16.0.141) Epoch 1/2
(pid=175321, ip=172.16.0.109) wnd_train_tf2_generator_horovod.py:156: FutureWarning: pyarrow.hdfs.connect is deprecated as of 2.0.0, please use pyarrow.fs.HadoopFileSystem instead.
(pid=175321, ip=172.16.0.109) Global rank:  5
(pid=175321, ip=172.16.0.109) Total workers:  8
(pid=175321, ip=172.16.0.109) Number of files for worker:  8
(pid=175321, ip=172.16.0.109) Data size for worker:  37783
(pid=175321, ip=172.16.0.109) Loading hdfs://172.16.0.105:8020/user/root/jwang/wnd_twitter_2/test_parquet/part-00035-a19e97e6-903f-4c9c-90f2-ba2327ba4171-c000.snappy.parquet of size 4817
(pid=135898, ip=172.16.0.117) Global rank:  7
(pid=135898, ip=172.16.0.117) Total workers:  8
(pid=135898, ip=172.16.0.117) Number of files for worker:  7
(pid=135898, ip=172.16.0.117) Data size for worker:  621284
(pid=135898, ip=172.16.0.117) Loading hdfs://172.16.0.105:8020/user/root/jwang/wnd_twitter_2/train_parquet/part-00049-cbd17f77-8da4-45c7-9031-919a6d619098-c000.snappy.parquet of size 89354
(pid=19077, ip=172.16.0.130) Global rank:  4
(pid=19077, ip=172.16.0.130) Total workers:  8
(pid=19077, ip=172.16.0.130) Number of files for worker:  8
(pid=19077, ip=172.16.0.130) Data size for worker:  727713
(pid=19077, ip=172.16.0.130) Loading hdfs://172.16.0.105:8020/user/root/jwang/wnd_twitter_2/train_parquet/part-00028-cbd17f77-8da4-45c7-9031-919a6d619098-c000.snappy.parquet of size 92033
(pid=19077, ip=172.16.0.130) wnd_train_tf2_generator_horovod.py:156: FutureWarning: pyarrow.hdfs.connect is deprecated as of 2.0.0, please use pyarrow.fs.HadoopFileSystem instead.
(pid=19077, ip=172.16.0.130) Global rank:  4
(pid=19077, ip=172.16.0.130) Total workers:  8
(pid=19077, ip=172.16.0.130) Number of files for worker:  8
(pid=19077, ip=172.16.0.130) Data size for worker:  38585
(pid=19077, ip=172.16.0.130) Loading hdfs://172.16.0.105:8020/user/root/jwang/wnd_twitter_2/test_parquet/part-00028-a19e97e6-903f-4c9c-90f2-ba2327ba4171-c000.snappy.parquet of size 4877
(pid=135898, ip=172.16.0.117) wnd_train_tf2_generator_horovod.py:156: FutureWarning: pyarrow.hdfs.connect is deprecated as of 2.0.0, please use pyarrow.fs.HadoopFileSystem instead.
(pid=135898, ip=172.16.0.117) Global rank:  7
(pid=135898, ip=172.16.0.117) Total workers:  8
(pid=135898, ip=172.16.0.117) Number of files for worker:  7
(pid=135898, ip=172.16.0.117) Data size for worker:  32902
(pid=135898, ip=172.16.0.117) Loading hdfs://172.16.0.105:8020/user/root/jwang/wnd_twitter_2/test_parquet/part-00049-a19e97e6-903f-4c9c-90f2-ba2327ba4171-c000.snappy.parquet of size 4688
    1/13632 [..............................] - ETA: 3s - loss: 0.7875 - accuracy: 0.5000
    2/13632 [..............................] - ETA: 5:42:44 - loss: 1.4955 - accuracy: 0.3333
    3/13632 [..............................] - ETA: 5:50:14 - loss: 1.3102 - accuracy: 0.3333
    4/13632 [..............................] - ETA: 5:41:48 - loss: 1.2190 - accuracy: 0.3333
    5/13632 [..............................] - ETA: 5:51:37 - loss: 1.2121 - accuracy: 0.4000
    6/13632 [..............................] - ETA: 5:59:01 - loss: 1.1963 - accuracy: 0.4167
    7/13632 [..............................] - ETA: 5:57:08 - loss: 1.2288 - accuracy: 0.3810
    8/13632 [..............................] - ETA: 6:28:54 - loss: 1.1886 - accuracy: 0.3750
    9/13632 [..............................] - ETA: 6:35:37 - loss: 1.1644 - accuracy: 0.4074
   10/13632 [..............................] - ETA: 6:32:36 - loss: 1.1647 - accuracy: 0.4167
   11/13632 [..............................] - ETA: 6:34:59 - loss: 1.1463 - accuracy: 0.4091
   12/13632 [..............................] - ETA: 6:32:12 - loss: 1.1479 - accuracy: 0.4167
   13/13632 [..............................] - ETA: 6:35:59 - loss: 1.1143 - accuracy: 0.4487
   14/13632 [..............................] - ETA: 6:30:26 - loss: 1.0713 - accuracy: 0.4524
   15/13632 [..............................] - ETA: 6:32:27 - loss: 1.0508 - accuracy: 0.4444
   16/13632 [..............................] - ETA: 6:33:04 - loss: 1.0256 - accuracy: 0.4583
   17/13632 [..............................] - ETA: 6:34:02 - loss: 1.0170 - accuracy: 0.4510
   18/13632 [..............................] - ETA: 6:32:58 - loss: 1.0300 - accuracy: 0.4537
   19/13632 [..............................] - ETA: 6:30:27 - loss: 1.0135 - accuracy: 0.4737
   20/13632 [..............................] - ETA: 6:27:51 - loss: 1.0064 - accuracy: 0.4750
   21/13632 [..............................] - ETA: 6:25:17 - loss: 0.9877 - accuracy: 0.4841
   22/13632 [..............................] - ETA: 6:22:12 - loss: 0.9812 - accuracy: 0.4924
   23/13632 [..............................] - ETA: 6:21:11 - loss: 0.9632 - accuracy: 0.4928
   24/13632 [..............................] - ETA: 6:18:43 - loss: 0.9786 - accuracy: 0.4792
   25/13632 [..............................] - ETA: 6:18:20 - loss: 0.9679 - accuracy: 0.4867
   26/13632 [..............................] - ETA: 6:17:25 - loss: 0.9692 - accuracy: 0.4744
   27/13632 [..............................] - ETA: 6:16:45 - loss: 0.9514 - accuracy: 0.4877
jenniew commented 3 years ago

I changed loss function, and still get this error. maybe it is environment problem. Can you try "py37-horovod-tf" on almaren-node-107?

jenniew commented 3 years ago

Get this error: terminate called after throwing an instance of 'gloo::IoException' (pid=11655, ip=172.16.0.121) what(): [/tmp/pip-install-x2psu8_w/horovod_8e87f6e8dcad47a6a27653365dfc240d/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:69] Timed out waiting 30000ms for recv operation to complete

jenniew commented 3 years ago

@leonardozcm I can run on your conda environment. Yes, it is environment issue. Can you wrap up your installation and configuration steps and add to documentation? @helenlly maybe we need to add tf2 horovod environment setup steps in docsite so the user can avoid environment issue as mine.

hkvision commented 3 years ago

@leonardozcm Can you write the installation steps here and @jenniew follow this steps to further verify on a new environment.

leonardozcm commented 3 years ago

Sorry about late, I will reproduce the installation process.

leonardozcm commented 3 years ago
# python 37 required for pyarrow

conda install -y cmake==3.16.0 -c conda-forge
conda install cxx-compiler==1.0 -c conda-forge
conda install openmpi
conda install tensorflow==2.3.0
HOROVOD_WITH_TENSORFLOW=1;HOROVOD_WITH_GLOO=1; pip install --no-cache-dir horovod
 pip install analytics-zoo[ray]

# solve conda pack issue on aiohttp
pip uninstall aiohttp
conda install aiohttp=3.7.4

conda install pyarrow==4.0.0 -c conda-forge

# ImportError: cannot import name 'parameter_server_strategy_v2' from 'tensorflow.python.distribute' on  tensorflow-estimator==2.5.0
conda install tensorflow-estimator==2.3.0

# run

@jenniew

yangw1234 commented 3 years ago
# python 37 required for pyarrow

conda install -y pytorch torchvision cpuonly -c pytorch
conda install -y cmake==3.16.0 -c conda-forge
conda install cxx-compiler==1.0 -c conda-forge
conda install openmpi
conda install tensorflow==2.3.0
HOROVOD_WITH_TENSORFLOW=1;HOROVOD_WITH_GLOO=1; pip install --no-cache-dir horovod
 pip install analytics-zoo[ray]

# solve conda pack issue on aiohttp
pip uninstall aiohttp
conda install aiohttp=3.7.4

conda install pyarrow==4.0.0 -c conda-forge

# ImportError: cannot import name 'parameter_server_strategy_v2' from 'tensorflow.python.distribute' on  tensorflow-estimator==2.5.0
conda install tensorflow-estimator==2.3.0

# run

@jenniew

why install pytorch in this case?

leonardozcm commented 3 years ago
# python 37 required for pyarrow

conda install -y pytorch torchvision cpuonly -c pytorch
conda install -y cmake==3.16.0 -c conda-forge
conda install cxx-compiler==1.0 -c conda-forge
conda install openmpi
conda install tensorflow==2.3.0
HOROVOD_WITH_TENSORFLOW=1;HOROVOD_WITH_GLOO=1; pip install --no-cache-dir horovod
 pip install analytics-zoo[ray]

# solve conda pack issue on aiohttp
pip uninstall aiohttp
conda install aiohttp=3.7.4

conda install pyarrow==4.0.0 -c conda-forge

# ImportError: cannot import name 'parameter_server_strategy_v2' from 'tensorflow.python.distribute' on  tensorflow-estimator==2.5.0
conda install tensorflow-estimator==2.3.0

# run

@jenniew

why install pytorch in this case?

It's a typo, removed.