Open jenniew opened 3 years ago
Would you mind provide more information?
1.The code is https://github.com/jenniew/friesian/blob/wnd_train_twitter/Training/WideDeep/twitter/wnd_train_tf2_generator_horovod.py
1.The code is https://github.com/jenniew/friesian/blob/wnd_train_twitter/Training/WideDeep/twitter/wnd_train_tf2_generator_horovod.py
- tensorflow 2.3.0, latest zoo, horovod:0.19.2, ray:1.2.0
- driver_cores :10 driver_memory: 30g num_executor: 8 executor_cores: 10 executor_memory: 30g
Thanks, we try to reproduce.
@jenniew where to find the data?
@leonardozcm could you help me take a look at this issue?
You may use the data: hdfs://172.16.0.105:8020/user/root/jwang/wnd_twitter_2
You may use the data: hdfs://172.16.0.105:8020/user/root/jwang/wnd_twitter_2
I will take a look. Since it's a private repo, would you mind giving me permission?
You may use the data: hdfs://172.16.0.105:8020/user/root/jwang/wnd_twitter_2
I will take a look. Since it's a private repo, would you mind giving me permission?
Yes, I already sent invitation to you.
You may use the data: hdfs://172.16.0.105:8020/user/root/jwang/wnd_twitter_2
I will take a look. Since it's a private repo, would you mind giving me permission?
Yes, I already sent invitation to you.
OK, thanks a lot.
tensorflow.python.framework.errors_impl.InvalidArgumentError: Received a label value of 1 which is outside the valid range of [0, 1). Label values: 1 1 0 1 1 1
do I miss something doing with softmax?
Change loss function to binary_crossentropy, and I didn't make it to reproduce this issue.
9/13632 [..............................] - ETA: 5:47:25 - loss: 0.6880 - accuracy: 0.5741
(pid=157408, ip=172.16.0.146) Global rank: 6
(pid=157408, ip=172.16.0.146) Total workers: 8
(pid=157408, ip=172.16.0.146) Number of files for worker: 8
(pid=157408, ip=172.16.0.146) Data size for worker: 671325
(pid=157408, ip=172.16.0.146) Loading hdfs://172.16.0.105:8020/user/root/jwang/wnd_twitter_2/train_parquet/part-00042-cbd17f77-8da4-45c7-9031-919a6d619098-c000.snappy.parquet of size 90332
(pid=193705, ip=172.16.0.148) Global rank: 1
(pid=193705, ip=172.16.0.148) Total workers: 8
(pid=193705, ip=172.16.0.148) Number of files for worker: 8
(pid=193705, ip=172.16.0.148) Data size for worker: 751492
(pid=193705, ip=172.16.0.148) Loading hdfs://172.16.0.105:8020/user/root/jwang/wnd_twitter_2/train_parquet/part-00007-cbd17f77-8da4-45c7-9031-919a6d619098-c000.snappy.parquet of size 95630
(pid=57625, ip=172.16.0.159) Global rank: 2
(pid=57625, ip=172.16.0.159) Total workers: 8
(pid=57625, ip=172.16.0.159) Number of files for worker: 8
(pid=57625, ip=172.16.0.159) Data size for worker: 741991
(pid=57625, ip=172.16.0.159) Loading hdfs://172.16.0.105:8020/user/root/jwang/wnd_twitter_2/train_parquet/part-00014-cbd17f77-8da4-45c7-9031-919a6d619098-c000.snappy.parquet of size 94079
(pid=195572, ip=172.16.0.141) Global rank: 0
(pid=195572, ip=172.16.0.141) Total workers: 8
(pid=195572, ip=172.16.0.141) Number of files for worker: 8
(pid=195572, ip=172.16.0.141) Data size for worker: 769372
(pid=195572, ip=172.16.0.141) Loading hdfs://172.16.0.105:8020/user/root/jwang/wnd_twitter_2/train_parquet/part-00000-cbd17f77-8da4-45c7-9031-919a6d619098-c000.snappy.parquet of size 99895
(pid=150683, ip=172.16.0.129) Global rank: 3
(pid=150683, ip=172.16.0.129) Total workers: 8
(pid=150683, ip=172.16.0.129) Number of files for worker: 8
(pid=150683, ip=172.16.0.129) Data size for worker: 734900
(pid=150683, ip=172.16.0.129) Loading hdfs://172.16.0.105:8020/user/root/jwang/wnd_twitter_2/train_parquet/part-00021-cbd17f77-8da4-45c7-9031-919a6d619098-c000.snappy.parquet of size 92860
(pid=57625, ip=172.16.0.159) wnd_train_tf2_generator_horovod.py:156: FutureWarning: pyarrow.hdfs.connect is deprecated as of 2.0.0, please use pyarrow.fs.HadoopFileSystem instead.
(pid=193705, ip=172.16.0.148) wnd_train_tf2_generator_horovod.py:156: FutureWarning: pyarrow.hdfs.connect is deprecated as of 2.0.0, please use pyarrow.fs.HadoopFileSystem instead.
(pid=57625, ip=172.16.0.159) Global rank: 2
(pid=57625, ip=172.16.0.159) Total workers: 8
(pid=57625, ip=172.16.0.159) Number of files for worker: 8
(pid=57625, ip=172.16.0.159) Data size for worker: 39095
(pid=57625, ip=172.16.0.159) Loading hdfs://172.16.0.105:8020/user/root/jwang/wnd_twitter_2/test_parquet/part-00014-a19e97e6-903f-4c9c-90f2-ba2327ba4171-c000.snappy.parquet of size 4952
(pid=157408, ip=172.16.0.146) wnd_train_tf2_generator_horovod.py:156: FutureWarning: pyarrow.hdfs.connect is deprecated as of 2.0.0, please use pyarrow.fs.HadoopFileSystem instead.
(pid=175321, ip=172.16.0.109) Global rank: 5
(pid=175321, ip=172.16.0.109) Total workers: 8
(pid=175321, ip=172.16.0.109) Number of files for worker: 8
(pid=175321, ip=172.16.0.109) Data size for worker: 721226
(pid=175321, ip=172.16.0.109) Loading hdfs://172.16.0.105:8020/user/root/jwang/wnd_twitter_2/train_parquet/part-00035-cbd17f77-8da4-45c7-9031-919a6d619098-c000.snappy.parquet of size 91049
(pid=193705, ip=172.16.0.148) Global rank: 1
(pid=193705, ip=172.16.0.148) Total workers: 8
(pid=193705, ip=172.16.0.148) Number of files for worker: 8
(pid=193705, ip=172.16.0.148) Data size for worker: 39777
(pid=193705, ip=172.16.0.148) Loading hdfs://172.16.0.105:8020/user/root/jwang/wnd_twitter_2/test_parquet/part-00007-a19e97e6-903f-4c9c-90f2-ba2327ba4171-c000.snappy.parquet of size 4986
(pid=157408, ip=172.16.0.146) Global rank: 6
(pid=157408, ip=172.16.0.146) Total workers: 8
(pid=157408, ip=172.16.0.146) Number of files for worker: 8
(pid=157408, ip=172.16.0.146) Data size for worker: 35231
(pid=157408, ip=172.16.0.146) Loading hdfs://172.16.0.105:8020/user/root/jwang/wnd_twitter_2/test_parquet/part-00042-a19e97e6-903f-4c9c-90f2-ba2327ba4171-c000.snappy.parquet of size 4724
(pid=150683, ip=172.16.0.129) wnd_train_tf2_generator_horovod.py:156: FutureWarning: pyarrow.hdfs.connect is deprecated as of 2.0.0, please use pyarrow.fs.HadoopFileSystem instead.
(pid=150683, ip=172.16.0.129) Global rank: 3
(pid=150683, ip=172.16.0.129) Total workers: 8
(pid=150683, ip=172.16.0.129) Number of files for worker: 8
(pid=150683, ip=172.16.0.129) Data size for worker: 38422
(pid=150683, ip=172.16.0.129) Loading hdfs://172.16.0.105:8020/user/root/jwang/wnd_twitter_2/test_parquet/part-00021-a19e97e6-903f-4c9c-90f2-ba2327ba4171-c000.snappy.parquet of size 5037
(pid=195572, ip=172.16.0.141) wnd_train_tf2_generator_horovod.py:156: FutureWarning: pyarrow.hdfs.connect is deprecated as of 2.0.0, please use pyarrow.fs.HadoopFileSystem instead.
(pid=195572, ip=172.16.0.141) Global rank: 0
(pid=195572, ip=172.16.0.141) Total workers: 8
(pid=195572, ip=172.16.0.141) Number of files for worker: 8
(pid=195572, ip=172.16.0.141) Data size for worker: 40863
(pid=195572, ip=172.16.0.141) Loading hdfs://172.16.0.105:8020/user/root/jwang/wnd_twitter_2/test_parquet/part-00000-a19e97e6-903f-4c9c-90f2-ba2327ba4171-c000.snappy.parquet of size 5307
(pid=195572, ip=172.16.0.141) Epoch 1/2
(pid=175321, ip=172.16.0.109) wnd_train_tf2_generator_horovod.py:156: FutureWarning: pyarrow.hdfs.connect is deprecated as of 2.0.0, please use pyarrow.fs.HadoopFileSystem instead.
(pid=175321, ip=172.16.0.109) Global rank: 5
(pid=175321, ip=172.16.0.109) Total workers: 8
(pid=175321, ip=172.16.0.109) Number of files for worker: 8
(pid=175321, ip=172.16.0.109) Data size for worker: 37783
(pid=175321, ip=172.16.0.109) Loading hdfs://172.16.0.105:8020/user/root/jwang/wnd_twitter_2/test_parquet/part-00035-a19e97e6-903f-4c9c-90f2-ba2327ba4171-c000.snappy.parquet of size 4817
(pid=135898, ip=172.16.0.117) Global rank: 7
(pid=135898, ip=172.16.0.117) Total workers: 8
(pid=135898, ip=172.16.0.117) Number of files for worker: 7
(pid=135898, ip=172.16.0.117) Data size for worker: 621284
(pid=135898, ip=172.16.0.117) Loading hdfs://172.16.0.105:8020/user/root/jwang/wnd_twitter_2/train_parquet/part-00049-cbd17f77-8da4-45c7-9031-919a6d619098-c000.snappy.parquet of size 89354
(pid=19077, ip=172.16.0.130) Global rank: 4
(pid=19077, ip=172.16.0.130) Total workers: 8
(pid=19077, ip=172.16.0.130) Number of files for worker: 8
(pid=19077, ip=172.16.0.130) Data size for worker: 727713
(pid=19077, ip=172.16.0.130) Loading hdfs://172.16.0.105:8020/user/root/jwang/wnd_twitter_2/train_parquet/part-00028-cbd17f77-8da4-45c7-9031-919a6d619098-c000.snappy.parquet of size 92033
(pid=19077, ip=172.16.0.130) wnd_train_tf2_generator_horovod.py:156: FutureWarning: pyarrow.hdfs.connect is deprecated as of 2.0.0, please use pyarrow.fs.HadoopFileSystem instead.
(pid=19077, ip=172.16.0.130) Global rank: 4
(pid=19077, ip=172.16.0.130) Total workers: 8
(pid=19077, ip=172.16.0.130) Number of files for worker: 8
(pid=19077, ip=172.16.0.130) Data size for worker: 38585
(pid=19077, ip=172.16.0.130) Loading hdfs://172.16.0.105:8020/user/root/jwang/wnd_twitter_2/test_parquet/part-00028-a19e97e6-903f-4c9c-90f2-ba2327ba4171-c000.snappy.parquet of size 4877
(pid=135898, ip=172.16.0.117) wnd_train_tf2_generator_horovod.py:156: FutureWarning: pyarrow.hdfs.connect is deprecated as of 2.0.0, please use pyarrow.fs.HadoopFileSystem instead.
(pid=135898, ip=172.16.0.117) Global rank: 7
(pid=135898, ip=172.16.0.117) Total workers: 8
(pid=135898, ip=172.16.0.117) Number of files for worker: 7
(pid=135898, ip=172.16.0.117) Data size for worker: 32902
(pid=135898, ip=172.16.0.117) Loading hdfs://172.16.0.105:8020/user/root/jwang/wnd_twitter_2/test_parquet/part-00049-a19e97e6-903f-4c9c-90f2-ba2327ba4171-c000.snappy.parquet of size 4688
1/13632 [..............................] - ETA: 3s - loss: 0.7875 - accuracy: 0.5000
2/13632 [..............................] - ETA: 5:42:44 - loss: 1.4955 - accuracy: 0.3333
3/13632 [..............................] - ETA: 5:50:14 - loss: 1.3102 - accuracy: 0.3333
4/13632 [..............................] - ETA: 5:41:48 - loss: 1.2190 - accuracy: 0.3333
5/13632 [..............................] - ETA: 5:51:37 - loss: 1.2121 - accuracy: 0.4000
6/13632 [..............................] - ETA: 5:59:01 - loss: 1.1963 - accuracy: 0.4167
7/13632 [..............................] - ETA: 5:57:08 - loss: 1.2288 - accuracy: 0.3810
8/13632 [..............................] - ETA: 6:28:54 - loss: 1.1886 - accuracy: 0.3750
9/13632 [..............................] - ETA: 6:35:37 - loss: 1.1644 - accuracy: 0.4074
10/13632 [..............................] - ETA: 6:32:36 - loss: 1.1647 - accuracy: 0.4167
11/13632 [..............................] - ETA: 6:34:59 - loss: 1.1463 - accuracy: 0.4091
12/13632 [..............................] - ETA: 6:32:12 - loss: 1.1479 - accuracy: 0.4167
13/13632 [..............................] - ETA: 6:35:59 - loss: 1.1143 - accuracy: 0.4487
14/13632 [..............................] - ETA: 6:30:26 - loss: 1.0713 - accuracy: 0.4524
15/13632 [..............................] - ETA: 6:32:27 - loss: 1.0508 - accuracy: 0.4444
16/13632 [..............................] - ETA: 6:33:04 - loss: 1.0256 - accuracy: 0.4583
17/13632 [..............................] - ETA: 6:34:02 - loss: 1.0170 - accuracy: 0.4510
18/13632 [..............................] - ETA: 6:32:58 - loss: 1.0300 - accuracy: 0.4537
19/13632 [..............................] - ETA: 6:30:27 - loss: 1.0135 - accuracy: 0.4737
20/13632 [..............................] - ETA: 6:27:51 - loss: 1.0064 - accuracy: 0.4750
21/13632 [..............................] - ETA: 6:25:17 - loss: 0.9877 - accuracy: 0.4841
22/13632 [..............................] - ETA: 6:22:12 - loss: 0.9812 - accuracy: 0.4924
23/13632 [..............................] - ETA: 6:21:11 - loss: 0.9632 - accuracy: 0.4928
24/13632 [..............................] - ETA: 6:18:43 - loss: 0.9786 - accuracy: 0.4792
25/13632 [..............................] - ETA: 6:18:20 - loss: 0.9679 - accuracy: 0.4867
26/13632 [..............................] - ETA: 6:17:25 - loss: 0.9692 - accuracy: 0.4744
27/13632 [..............................] - ETA: 6:16:45 - loss: 0.9514 - accuracy: 0.4877
I changed loss function, and still get this error. maybe it is environment problem. Can you try "py37-horovod-tf" on almaren-node-107?
Get this error: terminate called after throwing an instance of 'gloo::IoException' (pid=11655, ip=172.16.0.121) what(): [/tmp/pip-install-x2psu8_w/horovod_8e87f6e8dcad47a6a27653365dfc240d/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:69] Timed out waiting 30000ms for recv operation to complete
@leonardozcm I can run on your conda environment. Yes, it is environment issue. Can you wrap up your installation and configuration steps and add to documentation? @helenlly maybe we need to add tf2 horovod environment setup steps in docsite so the user can avoid environment issue as mine.
@leonardozcm Can you write the installation steps here and @jenniew follow this steps to further verify on a new environment.
Sorry about late, I will reproduce the installation process.
# python 37 required for pyarrow
conda install -y cmake==3.16.0 -c conda-forge
conda install cxx-compiler==1.0 -c conda-forge
conda install openmpi
conda install tensorflow==2.3.0
HOROVOD_WITH_TENSORFLOW=1;HOROVOD_WITH_GLOO=1; pip install --no-cache-dir horovod
pip install analytics-zoo[ray]
# solve conda pack issue on aiohttp
pip uninstall aiohttp
conda install aiohttp=3.7.4
conda install pyarrow==4.0.0 -c conda-forge
# ImportError: cannot import name 'parameter_server_strategy_v2' from 'tensorflow.python.distribute' on tensorflow-estimator==2.5.0
conda install tensorflow-estimator==2.3.0
# run
@jenniew
# python 37 required for pyarrow conda install -y pytorch torchvision cpuonly -c pytorch conda install -y cmake==3.16.0 -c conda-forge conda install cxx-compiler==1.0 -c conda-forge conda install openmpi conda install tensorflow==2.3.0 HOROVOD_WITH_TENSORFLOW=1;HOROVOD_WITH_GLOO=1; pip install --no-cache-dir horovod pip install analytics-zoo[ray] # solve conda pack issue on aiohttp pip uninstall aiohttp conda install aiohttp=3.7.4 conda install pyarrow==4.0.0 -c conda-forge # ImportError: cannot import name 'parameter_server_strategy_v2' from 'tensorflow.python.distribute' on tensorflow-estimator==2.5.0 conda install tensorflow-estimator==2.3.0 # run
@jenniew
why install pytorch in this case?
# python 37 required for pyarrow conda install -y pytorch torchvision cpuonly -c pytorch conda install -y cmake==3.16.0 -c conda-forge conda install cxx-compiler==1.0 -c conda-forge conda install openmpi conda install tensorflow==2.3.0 HOROVOD_WITH_TENSORFLOW=1;HOROVOD_WITH_GLOO=1; pip install --no-cache-dir horovod pip install analytics-zoo[ray] # solve conda pack issue on aiohttp pip uninstall aiohttp conda install aiohttp=3.7.4 conda install pyarrow==4.0.0 -c conda-forge # ImportError: cannot import name 'parameter_server_strategy_v2' from 'tensorflow.python.distribute' on tensorflow-estimator==2.5.0 conda install tensorflow-estimator==2.3.0 # run
@jenniew
why install pytorch in this case?
It's a typo, removed.
TF2Estimator failed with horovod backend if data_creator is tf.data.Dataset from generator, the error is as below:
2021-07-21 04:17:10,169 WARNING worker.py:1107 -- A worker died or was killed while executing task ffffffffffffffff63964fa4841d4a2ecb45751801000000. Traceback (most recent call last): File "wnd_train_tf2_generator_horovod.py", line 449, in
validation_steps=test_steps)
File "/root/anaconda3/envs/py37-horovod-tf/lib/python3.7/site-packages/zoo/orca/learn/tf2/estimator.py", line 257, in fit
for i in range(self.num_workers)])
File "/root/anaconda3/envs/py37-horovod-tf/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
return func(*args, *kwargs)
File "/root/anaconda3/envs/py37-horovod-tf/lib/python3.7/site-packages/ray/worker.py", line 1458, in get
raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. Check python-core-worker-.log files for more information.