swsnu / bd2018

25 stars 10 forks source link

[HW2 공지] GPU가 장착된 머신으로 cpu_enable branch 사용시 버그 수정 #25

Open gyeongin opened 5 years ago

gyeongin commented 5 years ago

GPU가 장착되어 있는 머신에서 cpu_enable branch를 사용해

localhost
localhost

따위로 cpu worker 2개를 사용하려 할 때 버그가 있어 이를 수정하였습니다. GPU가 장착된 머신에서 CPU만 이용해 학습하려 하실 경우 cpu_enable branch를 새로 pull 해 주시길 바랍니다.

jeeyung commented 5 years ago

cpu_enable branch를 pull 했는데도, 같은 error가 발생합니다.

gyeongin commented 5 years ago

사용하신 resource info가 무엇인가요? 발생한 error message가 무엇인가요?

jeeyung commented 5 years ago

localhost localhost

이구요,

error message는 아래와 같습니다.

WARNING:tensorflow:From /home/jeeyung/parallax_venv/local/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:290: init (from tensorflow.[0/1498]learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version. Instructions for updating: Please use alternatives such as official/mnist/dataset.py from tensorflow/models. INFO:140116569343744:PARALLAX:parallel_run(PARALLAX_RUN_MPI) INFO:140116569343744:PARALLAX:resource ps_localhost:46849:+localhost:37463:^worker_localhost:43741:0+localhost:38105:0 INFO:139728530999040:PARALLAX:parallel_run(PARALLAX_RUN_MPI) INFO:139728530999040:PARALLAX:resource ps_localhost:46849:+localhost:37463:^worker_localhost:43741:0+localhost:38105:0 Traceback (most recent call last): File "/home/jeeyung/Dropbox/school_materials/large_scale_data/hw2_cp/run_parallax.py", line 58, in parallax_config=parallax_config) Traceback (most recent call last): File "/home/jeeyung/Dropbox/school_materials/large_scale_data/hw2_cp/run_parallax.py", line 58, in File "/home/jeeyung/parallax_venv/local/lib/python2.7/site-packages/parallax/core/python/common/runner.py", line 154, in parallel_run parallax_config=parallax_config) File "/home/jeeyung/parallax_venv/local/lib/python2.7/site-packages/parallax/core/python/common/runner.py", line 154, in parallel_run return parallax_run_mpi(kwargs) File "/home/jeeyung/parallax_venv/local/lib/python2.7/site-packages/parallax/core/python/mpi/runner.py", line 137, in parallax_run_mpi return parallax_run_mpi(kwargs) File "/home/jeeyung/parallax_venv/local/lib/python2.7/site-packages/parallax/core/python/mpi/runner.py", line 137, in parallax_run_mpi graph_transform_mpi(single_gpu_meta_graph_def, config) File "/home/jeeyung/parallax_venv/local/lib/python2.7/site-packages/parallax/core/python/mpi/graph_transform.py", line 101, in graph_transform_mpi graph_transform_mpi(single_gpu_meta_graph_def, config) File "/home/jeeyung/parallax_venv/local/lib/python2.7/site-packages/parallax/core/python/mpi/graph_transform.py", line 101, in graph_transform_mpi _add_aggregation_ops(gradients_info, op_to_control_consumer_ops, config) File "/home/jeeyung/parallax_venv/local/lib/python2.7/site-packages/parallax/core/python/mpi/graph_transform.py", line 43, in _add_aggregation_ops _add_aggregation_ops(gradients_info, op_to_control_consumer_ops, config) File "/home/jeeyung/parallax_venv/local/lib/python2.7/site-packages/parallax/core/python/mpi/graph_transform.py", line 43, in _add_aggregation_ops use_allgatherv=config.communication_config.mpi_config.use_allgatherv) use_allgatherv=config.communication_config.mpi_config.use_allgatherv) TypeError: allreduce() got an unexpected keyword argument 'average_dense' TypeError: allreduce() got an unexpected keyword argument 'average_dense'

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[10466,1],1] Exit code: 1

gyeongin commented 5 years ago

Horovod를 어떻게 설치하셨나요?

jeeyung commented 5 years ago

horovod에서도 cpu worker를 2개 사용하기 위해

pip install horovod

를 사용했습니다.

HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_ALLGATHER=NCCL HOROVOD_WITHOUT_PYTORCH=True pip install --no-cache-dir dist/horovod-*.tar.gz

이렇게 horovod를 설치하고, parallax를 실행했을 땐,

tensorflow.python.framework.errors_impl.InvalidArgumentError: 'visible_device_list' listed an invalid GPU id '1' but visible device count is 1

이 error message가 나왔습니다.

gyeongin commented 5 years ago

말씀해주신 TypeError: allreduce() got an unexpected keyword argument 'average_dense'는 pip install horovod로 생긴 문제입니다. 원래대로

python setup.py sdist
HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_ALLGATHER=NCCL HOROVOD_WITHOUT_PYTORCH=True pip install --no-cache-dir dist/horovod-*.tar.gz

로 설치해주시길 바랍니다.

이것과 별개로, tensorflow.python.framework.errors_impl.InvalidArgumentError: 'visible_device_list' listed an invalid GPU id '1' but visible device count is 1 이 에러가 나는건 "사용 가능한 GPU가 있음에도 CPU를 사용하려고 할 때" 발생하는 버그네요 :cry: CPU 지원 branch를 따로 만들었는데도 코너 케이스를 제대로 핸들링 못했던 것 같습니다... 지금 hot fix를 push했으니, 새로 pull 해서 테스트 부탁드립니다.

jeeyung commented 5 years ago

새로 pull 해도 같은 error 입니다...ㅜㅜ

gyeongin commented 5 years ago

저는 해당 에러가 재현이 안되는데, 혹시 새로 parallax build 및 pip install --upgrade 하셨는지 아래 방법으로 확인 부탁드립니다: /home/jeeyung/parallax_venv/local/lib/python2.7/site-packages/parallax/core/python/hybrid/runner.py 파일의 192 라인이 link와 같은지 확인

jeeyung commented 5 years ago

해결됐습니다! 감사합니다.!!

bgchun commented 5 years ago

@gyeongin Thanks!