gpauloski / kfac-pytorch

Distributed K-FAC Preconditioner for PyTorch
MIT License
75 stars 25 forks source link

Running problems #57

Closed Elec-coder closed 2 years ago

Elec-coder commented 2 years ago

How did you install K-FAC and PyTorch?

$ git clone https://github.com/gpauloski/kfac-pytorch.git
$ cd kfac-pytorch
$ pip install -e .

What version of commit are you using?

v0.4.1

Describe the problem.

Hi, gpauloski, after #54 , another problem occurred and I can't fix it, could you help me? Thank you

torchrun --standalone --nnodes 1 --nproc_per_node=4 torch_cifar10_resnet.py WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


/home/qzy/miniconda3/envs/env_fac/lib/python3.7/site-packages/kfac/distributed.py:18: UserWarning: NVIDIA Apex is not installed or was not installed with --cpp_ext. Falling back to PyTorch flatten and unflatten. 'NVIDIA Apex is not installed or was not installed with --cpp_ext. ' /home/qzy/miniconda3/envs/env_fac/lib/python3.7/site-packages/kfac/distributed.py:18: UserWarning: NVIDIA Apex is not installed or was not installed with --cpp_ext. Falling back to PyTorch flatten and unflatten. 'NVIDIA Apex is not installed or was not installed with --cpp_ext. ' /home/qzy/miniconda3/envs/env_fac/lib/python3.7/site-packages/kfac/distributed.py:18: UserWarning: NVIDIA Apex is not installed or was not installed with --cpp_ext. Falling back to PyTorch flatten and unflatten. 'NVIDIA Apex is not installed or was not installed with --cpp_ext. ' /home/qzy/miniconda3/envs/env_fac/lib/python3.7/site-packages/kfac/distributed.py:18: UserWarning: NVIDIA Apex is not installed or was not installed with --cpp_ext. Falling back to PyTorch flatten and unflatten. 'NVIDIA Apex is not installed or was not installed with --cpp_ext. ' [W socket.cpp:558] [c10d] The client socket has failed to connect to [txjgsv10]:55789 (errno: 22 - Invalid argument). [W socket.cpp:558] [c10d] The clie**nt socket has failed to connect to [txjgsv10]:55789 (errno: 22 - Invalid argument). Collecting env info... PyTorch version: 1.11.0+cu102 Is debug build: False CUDA used to build PyTorch: 10.2 ROCM used to build PyTorch: N/A

OS: CentOS Linux release 7.8.2003 (Core) (x86_64) GCC version: (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5) Clang version: 3.9.0 (tags/RELEASE_390/final) CMake version: version 3.14.0 Libc version: glibc-2.17

Python version: 3.7.11 (default, Jul 27 2021, 14:32:16) [GCC 7.5.0] (64-bit runtime) Python platform: Linux-3.10.0-1127.13.1.el7.x86_64-x86_64-with-centos-7.8.2003-Core Is CUDA available: True CUDA runtime version: 10.1.243 GPU models and configuration: GPU 0: Tesla V100-PCIE-16GB GPU 1: Tesla V100S-PCIE-32GB GPU 2: Tesla V100-PCIE-16GB GPU 3: Tesla V100S-PCIE-32GB

Nvidia driver version: 495.29.05 cuDNN version: Probably one of the following: /usr/lib64/libcudnn.so.7.6.5 /usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn.so.8.3.0 /usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.3.0 /usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.3.0 /usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.3.0 /usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.3.0 /usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.3.0 /usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.3.0 /usr/local/cuda-9.0/lib64/libcudnn.so.7.6.5 HIP runtime version: N/A MIOpen runtime version: N/A

Versions of relevant libraries: [pip3] kfac-pytorch==0.4.1 [pip3] numpy==1.21.6 [pip3] torch==1.11.0 [pip3] torchinfo==1.5.2 [pip3] torchvision==0.12.0 [conda] kfac-pytorch 0.4.1 pypi_0 pypi [conda] numpy 1.21.6 pypi_0 pypi [conda] torch 1.11.0 pypi_0 pypi [conda] torchinfo 1.5.2 pypi_0 pypi [conda] torchvision 0.12.0 pypi_0 pypi

Global rank 0 initialized: local_rank = 0, world_size = 4 Global rank 1 initialized: local_rank = 1, world_size = 4 Global rank 2 initialized: local_rank = 2, world_size = 4 Global rank 3 initialized: local_rank = 3, world_size = 4 Traceback (most recent call last): File "/home/qzy/miniconda3/envs/env_fac/lib/python3.7/urllib/request.py", line 1350, in do_open encode_chunked=req.has_header('Transfer-encoding')) File "/home/qzy/miniconda3/envs/env_fac/lib/python3.7/http/client.py", line 1281, in request self._send_request(method, url, body, headers, encode_chunked) File "/home/qzy/miniconda3/envs/env_fac/lib/python3.7/http/client.py", line 1327, in _send_request self.endheaders(body, encode_chunked=encode_chunked) File "/home/qzy/miniconda3/envs/env_fac/lib/python3.7/http/client.py", line 1276, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "/home/qzy/miniconda3/envs/env_fac/lib/python3.7/http/client.py", line 1036, in _send_output self.send(msg) File "/home/qzy/miniconda3/envs/env_fac/lib/python3.7/http/client.py", line 976, in send self.connect() File "/home/qzy/miniconda3/envs/env_fac/lib/python3.7/http/client.py", line 1443, in connect super().connect() File "/home/qzy/miniconda3/envs/env_fac/lib/python3.7/http/client.py", line 948, in connect (self.host,self.port), self.timeout, self.source_address) File "/home/qzy/miniconda3/envs/env_fac/lib/python3.7/socket.py", line 707, in create_connection for res in getaddrinfo(host, port, 0, SOCK_STREAM): File "/home/qzy/miniconda3/envs/env_fac/lib/python3.7/socket.py", line 752, in getaddrinfo for res in _socket.getaddrinfo(host, port, family, type, proto, flags): socket.gaierror: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "torch_cifar10_resnet.py", line 395, in main() File "torch_cifar10_resnet.py", line 293, in main train_sampler, trainloader, , val_loader = datasets.get_cifar(args) File "/home/qzy/NGD/kfac-pytorch/examples/cnn_utils/datasets.py", line 52, in get_cifar transform=transform_train, File "/home/qzy/miniconda3/envs/env_fac/lib/python3.7/site-packages/torchvision/datasets/cifar.py", line 65, in init self.download() File "/home/qzy/miniconda3/envs/env_fac/lib/python3.7/site-packages/torchvision/datasets/cifar.py", line 141, in download download_and_extract_archive(self.url, self.root, filename=self.filename, md5=self.tgz_md5) File "/home/qzy/miniconda3/envs/env_fac/lib/python3.7/site-packages/torchvision/datasets/utils.py", line 430, in download_and_extract_archive download_url(url, download_root, filename, md5) File "/home/qzy/miniconda3/envs/env_fac/lib/python3.7/site-packages/torchvision/datasets/utils.py", line 131, in download_url url = _get_redirect_url(url, max_hops=max_redirect_hops) File "/home/qzy/miniconda3/envs/env_fac/lib/python3.7/site-packages/torchvision/datasets/utils.py", line 79, in _get_redirect_url with urllib.request.urlopen(urllib.request.Request(url, headers=headers)) as response: File "/home/qzy/miniconda3/envs/env_fac/lib/python3.7/urllib/request.py", line 222, in urlopen return opener.open(url, data, timeout) File "/home/qzy/miniconda3/envs/env_fac/lib/python3.7/urllib/request.py", line 525, in open response = self._open(req, data) File "/home/qzy/miniconda3/envs/env_fac/lib/python3.7/urllib/request.py", line 543, in _open '_open', req) File "/home/qzy/miniconda3/envs/env_fac/lib/python3.7/urllib/request.py", line 503, in _call_chain result = func(args) File "/home/qzy/miniconda3/envs/env_fac/lib/python3.7/urllib/request.py", line 1393, in https_open context=self._context, check_hostname=self._check_hostname) File "/home/qzy/miniconda3/envs/env_fac/lib/python3.7/urllib/request.py", line 1352, in do_open raise URLError(err) urllib.error.URLError: <urlopen error [Errno -2] Name or service not known> WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 60502 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 60503 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 60504 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 60501) of binary: /home/qzy/miniconda3/envs/env_fac/bin/python Traceback (most recent call last): File "/home/qzy/miniconda3/envs/env_fac/bin/torchrun", line 8, in sys.exit(main()) File "/home/qzy/miniconda3/envs/env_fac/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(args, *kwargs) File "/home/qzy/miniconda3/envs/env_fac/lib/python3.7/site-packages/torch/distributed/run.py", line 724, in main run(args) File "/home/qzy/miniconda3/envs/env_fac/lib/python3.7/site-packages/torch/distributed/run.py", line 718, in run )(cmd_args) File "/home/qzy/miniconda3/envs/env_fac/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/qzy/miniconda3/envs/env_fac/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent failures=result.failures, torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

torch_cifar10_resnet.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2022-06-13_11:09:10 host : txjgsv10 rank : 0 (local_rank: 0) exitcode : 1 (pid: 60501) error_file:
gpauloski commented 2 years ago

Hi @Elec-coder, this issue is your network being unable to connect to the URL to download the Cifar10 datasets.

To point you in the right direction, I would look at possible causes of the error on this line:

"/home/qzy/miniconda3/envs/env_fac/lib/python3.7/urllib/request.py", line 1352, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [Errno -2] Name or service not known>

I would guess it is your firewall, DNS, HTTP proxy or something of that like, but I cannot help you further since it is specific to your network configuration and not K-FAC.

As a quick fix, you can download the dataset on another machine and copy it over and update the data path with the --data-dir command line argument.

Elec-coder commented 2 years ago

Thank you!