pytorch / xla

Enabling PyTorch on XLA Devices (e.g. Google TPU)
https://pytorch.org/xla
Other
2.48k stars 478 forks source link

r1.13 dlrm test crashes on 2vm. #4115

Closed vanbasten23 closed 2 years ago

vanbasten23 commented 2 years ago

πŸ› Bug

In the r1.13 release 2vm image, the dlrm test crashes. The error I got is:

vm:~$ pip install onnx
Collecting onnx
  Downloading onnx-1.12.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.1 MB)
     |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 13.1 MB 4.7 MB/s 
Requirement already satisfied: numpy>=1.16.6 in /anaconda3/envs/torch-xla-1.13/lib/python3.7/site-packages (from onnx) (1.21.5)
Requirement already satisfied: typing-extensions>=3.6.2.1 in /anaconda3/envs/torch-xla-1.13/lib/python3.7/site-packages (from onnx) (4.3.0)
Collecting protobuf<=3.20.1,>=3.12.2
  Downloading protobuf-3.20.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.0 MB)
     |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1.0 MB 43.6 MB/s 
Installing collected packages: protobuf, onnx
  Attempting uninstall: protobuf
    Found existing installation: protobuf 3.20.3
    Uninstalling protobuf-3.20.3:
      Successfully uninstalled protobuf-3.20.3
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-api-core 1.33.2 requires protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<4.0.0dev,>=3.19.5, but you have protobuf 3.20.1 which is incompatible.
Successfully installed onnx-1.12.0 protobuf-3.20.1
(torch-xla-1.13) xiowei@xiowei-2vm-red-vm:~$ python /usr/share/torch-xla-1.12/tpu-examples/deps/dlrm/dlrm_tpu_runner.py
Using CPU...
time/loss/accuracy (if enabled):  2022-10-21 18:56:37.965793
Finished training it 1/1 of epoch 0, -1.00 ms/it, loss 0.083850, accuracy 0.000 %, 1 samples, @ 2022-10-21 18:56:38.079562
Exception in device=TPU:5: tensorflow/compiler/xla/xla_client/mesh_service.cc:329 : Check failed: impl_->channel->WaitForConnected( std::chrono::system_clock::now() + std::chrono::seconds(connect_wait_seconds)) 
*** Begin stack trace ***
        tsl::CurrentStackTrace[abi:cxx11]()
        xla::service::MeshClient::MeshClient(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
        xla::service::MeshClient::Get()
        xla::ComputationClient::Create()

        xla::ComputationClient::Get()

        _PyMethodDef_RawFastCallKeywords
        _PyCFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallDict

        _PyObject_GenericGetAttrWithDict
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault

        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallDict
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallDict
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        PyEval_EvalCodeEx
        PyEval_EvalCode

        PyRun_StringFlags
        PyRun_SimpleStringFlags

        _Py_UnixMain
        __libc_start_main

*** End stack trace ***
Failed to connect to client mesh master: xiowei-2vm-red-vm.c.tpu-pytorch.internal:44145
Exception in device=TPU:1: tensorflow/compiler/xla/xla_client/mesh_service.cc:329 : Check failed: impl_->channel->WaitForConnected( std::chrono::system_clock::now() + std::chrono::seconds(connect_wait_seconds)) 
*** Begin stack trace ***
        tsl::CurrentStackTrace[abi:cxx11]()
        xla::service::MeshClient::MeshClient(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
        xla::service::MeshClient::Get()
        xla::ComputationClient::Create()

        xla::ComputationClient::Get()

        _PyMethodDef_RawFastCallKeywords
        _PyCFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallDict

        _PyObject_GenericGetAttrWithDict
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault

        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallDict
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallDict
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        PyEval_EvalCodeEx
        PyEval_EvalCode

        PyRun_StringFlags
        PyRun_SimpleStringFlags

        _Py_UnixMain
        __libc_start_main

*** End stack trace ***
Failed to connect to client mesh master: xiowei-2vm-red-vm.c.tpu-pytorch.internal:44145
Exception in device=TPU:3: tensorflow/compiler/xla/xla_client/mesh_service.cc:329 : Check failed: impl_->channel->WaitForConnected( std::chrono::system_clock::now() + std::chrono::seconds(connect_wait_seconds)) 
*** Begin stack trace ***
        tsl::CurrentStackTrace[abi:cxx11]()
        xla::service::MeshClient::MeshClient(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
        xla::service::MeshClient::Get()
        xla::ComputationClient::Create()

        xla::ComputationClient::Get()

        _PyMethodDef_RawFastCallKeywords
        _PyCFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallDict

        _PyObject_GenericGetAttrWithDict
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault

        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallDict
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallDict
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        PyEval_EvalCodeEx
        PyEval_EvalCode

        PyRun_StringFlags
        PyRun_SimpleStringFlags

        _Py_UnixMain
        __libc_start_main

*** End stack trace ***
Failed to connect to client mesh master: xiowei-2vm-red-vm.c.tpu-pytorch.internal:44145
Exception in device=TPU:7: tensorflow/compiler/xla/xla_client/mesh_service.cc:329 : Check failed: impl_->channel->WaitForConnected( std::chrono::system_clock::now() + std::chrono::seconds(connect_wait_seconds)) 
*** Begin stack trace ***
        tsl::CurrentStackTrace[abi:cxx11]()
        xla::service::MeshClient::MeshClient(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
        xla::service::MeshClient::Get()
        xla::ComputationClient::Create()

        xla::ComputationClient::Get()

        _PyMethodDef_RawFastCallKeywords
        _PyCFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallDict

        _PyObject_GenericGetAttrWithDict
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault

        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallDict
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallDict
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        PyEval_EvalCodeEx
        PyEval_EvalCode

        PyRun_StringFlags
        PyRun_SimpleStringFlags

        _Py_UnixMain
        __libc_start_main

*** End stack trace ***
Failed to connect to client mesh master: xiowei-2vm-red-vm.c.tpu-pytorch.internal:44145Traceback (most recent call last):
  File "/anaconda3/envs/torch-xla-1.13/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 331, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/anaconda3/envs/torch-xla-1.13/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 324, in _start_fn
    _setup_replication()
  File "/anaconda3/envs/torch-xla-1.13/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 316, in _setup_replication
    device = xm.xla_device()
  File "/anaconda3/envs/torch-xla-1.13/lib/python3.7/site-packages/torch_xla/core/xla_model.py", line 245, in xla_device
    devices = get_xla_supported_devices(devkind=devkind)
  File "/anaconda3/envs/torch-xla-1.13/lib/python3.7/site-packages/torch_xla/core/xla_model.py", line 139, in get_xla_supported_devices
    xla_devices = _DEVICES.value
  File "/anaconda3/envs/torch-xla-1.13/lib/python3.7/site-packages/torch_xla/utils/utils.py", line 32, in value
    self._value = self._gen_fn()
  File "/anaconda3/envs/torch-xla-1.13/lib/python3.7/site-packages/torch_xla/core/xla_model.py", line 21, in <lambda>
    _DEVICES = xu.LazyProperty(lambda: torch_xla._XLAC._xla_get_devices())
RuntimeError: tensorflow/compiler/xla/xla_client/mesh_service.cc:329 : Check failed: impl_->channel->WaitForConnected( std::chrono::system_clock::now() + std::chrono::seconds(connect_wait_seconds)) 
*** Begin stack trace ***
        tsl::CurrentStackTrace[abi:cxx11]()
        xla::service::MeshClient::MeshClient(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
        xla::service::MeshClient::Get()
        xla::ComputationClient::Create()

        xla::ComputationClient::Get()

        _PyMethodDef_RawFastCallKeywords
        _PyCFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallDict

        _PyObject_GenericGetAttrWithDict
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault

        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallDict
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallDict
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        PyEval_EvalCodeEx
        PyEval_EvalCode

        PyRun_StringFlags
        PyRun_SimpleStringFlags

        _Py_UnixMain
        __libc_start_main

*** End stack trace ***
Failed to connect to client mesh master: xiowei-2vm-red-vm.c.tpu-pytorch.internal:44145

Traceback (most recent call last):
  File "/anaconda3/envs/torch-xla-1.13/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 331, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/anaconda3/envs/torch-xla-1.13/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 324, in _start_fn
    _setup_replication()
  File "/anaconda3/envs/torch-xla-1.13/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 316, in _setup_replication
    device = xm.xla_device()
  File "/anaconda3/envs/torch-xla-1.13/lib/python3.7/site-packages/torch_xla/core/xla_model.py", line 245, in xla_device
    devices = get_xla_supported_devices(devkind=devkind)
  File "/anaconda3/envs/torch-xla-1.13/lib/python3.7/site-packages/torch_xla/core/xla_model.py", line 139, in get_xla_supported_devices
    xla_devices = _DEVICES.value
  File "/anaconda3/envs/torch-xla-1.13/lib/python3.7/site-packages/torch_xla/utils/utils.py", line 32, in value
    self._value = self._gen_fn()
  File "/anaconda3/envs/torch-xla-1.13/lib/python3.7/site-packages/torch_xla/core/xla_model.py", line 21, in <lambda>
    _DEVICES = xu.LazyProperty(lambda: torch_xla._XLAC._xla_get_devices())
RuntimeError: tensorflow/compiler/xla/xla_client/mesh_service.cc:329 : Check failed: impl_->channel->WaitForConnected( std::chrono::system_clock::now() + std::chrono::seconds(connect_wait_seconds)) 
*** Begin stack trace ***
        tsl::CurrentStackTrace[abi:cxx11]()
        xla::service::MeshClient::MeshClient(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
        xla::service::MeshClient::Get()
        xla::ComputationClient::Create()

        xla::ComputationClient::Get()

        _PyMethodDef_RawFastCallKeywords
        _PyCFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallDict

        _PyObject_GenericGetAttrWithDict
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault

        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallDict
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallDict
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        PyEval_EvalCodeEx
        PyEval_EvalCode

        PyRun_StringFlags
        PyRun_SimpleStringFlags

        _Py_UnixMain
        __libc_start_main

*** End stack trace ***
Failed to connect to client mesh master: xiowei-2vm-red-vm.c.tpu-pytorch.internal:44145
Traceback (most recent call last):
Traceback (most recent call last):
  File "/anaconda3/envs/torch-xla-1.13/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 331, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/anaconda3/envs/torch-xla-1.13/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 331, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/anaconda3/envs/torch-xla-1.13/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 324, in _start_fn
    _setup_replication()
  File "/anaconda3/envs/torch-xla-1.13/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 316, in _setup_replication
    device = xm.xla_device()
  File "/anaconda3/envs/torch-xla-1.13/lib/python3.7/site-packages/torch_xla/core/xla_model.py", line 245, in xla_device
    devices = get_xla_supported_devices(devkind=devkind)
  File "/anaconda3/envs/torch-xla-1.13/lib/python3.7/site-packages/torch_xla/core/xla_model.py", line 139, in get_xla_supported_devices
    xla_devices = _DEVICES.value
  File "/anaconda3/envs/torch-xla-1.13/lib/python3.7/site-packages/torch_xla/utils/utils.py", line 32, in value
    self._value = self._gen_fn()
  File "/anaconda3/envs/torch-xla-1.13/lib/python3.7/site-packages/torch_xla/core/xla_model.py", line 21, in <lambda>
    _DEVICES = xu.LazyProperty(lambda: torch_xla._XLAC._xla_get_devices())
RuntimeError: tensorflow/compiler/xla/xla_client/mesh_service.cc:329 : Check failed: impl_->channel->WaitForConnected( std::chrono::system_clock::now() + std::chrono::seconds(connect_wait_seconds)) 
*** Begin stack trace ***
        tsl::CurrentStackTrace[abi:cxx11]()
        xla::service::MeshClient::MeshClient(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
        xla::service::MeshClient::Get()
        xla::ComputationClient::Create()

        xla::ComputationClient::Get()

        _PyMethodDef_RawFastCallKeywords
        _PyCFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallDict

        _PyObject_GenericGetAttrWithDict
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault

        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallDict
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallDict
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        PyEval_EvalCodeEx
        PyEval_EvalCode

        PyRun_StringFlags
        PyRun_SimpleStringFlags

        _Py_UnixMain
        __libc_start_main

*** End stack trace ***
Failed to connect to client mesh master: xiowei-2vm-red-vm.c.tpu-pytorch.internal:44145
  File "/anaconda3/envs/torch-xla-1.13/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 324, in _start_fn
    _setup_replication()
  File "/anaconda3/envs/torch-xla-1.13/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 316, in _setup_replication
    device = xm.xla_device()
  File "/anaconda3/envs/torch-xla-1.13/lib/python3.7/site-packages/torch_xla/core/xla_model.py", line 245, in xla_device
    devices = get_xla_supported_devices(devkind=devkind)
  File "/anaconda3/envs/torch-xla-1.13/lib/python3.7/site-packages/torch_xla/core/xla_model.py", line 139, in get_xla_supported_devices
    xla_devices = _DEVICES.value
  File "/anaconda3/envs/torch-xla-1.13/lib/python3.7/site-packages/torch_xla/utils/utils.py", line 32, in value
    self._value = self._gen_fn()
  File "/anaconda3/envs/torch-xla-1.13/lib/python3.7/site-packages/torch_xla/core/xla_model.py", line 21, in <lambda>
    _DEVICES = xu.LazyProperty(lambda: torch_xla._XLAC._xla_get_devices())
RuntimeError: tensorflow/compiler/xla/xla_client/mesh_service.cc:329 : Check failed: impl_->channel->WaitForConnected( std::chrono::system_clock::now() + std::chrono::seconds(connect_wait_seconds)) 
*** Begin stack trace ***
        tsl::CurrentStackTrace[abi:cxx11]()
        xla::service::MeshClient::MeshClient(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
        xla::service::MeshClient::Get()
        xla::ComputationClient::Create()

        xla::ComputationClient::Get()

        _PyMethodDef_RawFastCallKeywords
        _PyCFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallDict

        _PyObject_GenericGetAttrWithDict
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault

        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallDict
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallDict
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        PyEval_EvalCodeEx
        PyEval_EvalCode

        PyRun_StringFlags
        PyRun_SimpleStringFlags

        _Py_UnixMain
        __libc_start_main

*** End stack trace ***
Failed to connect to client mesh master: xiowei-2vm-red-vm.c.tpu-pytorch.internal:44145
Exception in device=TPU:2: tensorflow/compiler/xla/xla_client/mesh_service.cc:329 : Check failed: impl_->channel->WaitForConnected( std::chrono::system_clock::now() + std::chrono::seconds(connect_wait_seconds)) 
*** Begin stack trace ***
        tsl::CurrentStackTrace[abi:cxx11]()
        xla::service::MeshClient::MeshClient(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
        xla::service::MeshClient::Get()
        xla::ComputationClient::Create()

        xla::ComputationClient::Get()

        _PyMethodDef_RawFastCallKeywords
        _PyCFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallDict

        _PyObject_GenericGetAttrWithDict
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault

        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallDict
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallDict
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        PyEval_EvalCodeEx
        PyEval_EvalCode

        PyRun_StringFlags
        PyRun_SimpleStringFlags

        _Py_UnixMain
        __libc_start_main

*** End stack trace ***
Failed to connect to client mesh master: xiowei-2vm-red-vm.c.tpu-pytorch.internal:44145
Traceback (most recent call last):
  File "/anaconda3/envs/torch-xla-1.13/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 331, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/anaconda3/envs/torch-xla-1.13/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 324, in _start_fn
    _setup_replication()
  File "/anaconda3/envs/torch-xla-1.13/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 316, in _setup_replication
    device = xm.xla_device()
  File "/anaconda3/envs/torch-xla-1.13/lib/python3.7/site-packages/torch_xla/core/xla_model.py", line 245, in xla_device
    devices = get_xla_supported_devices(devkind=devkind)
  File "/anaconda3/envs/torch-xla-1.13/lib/python3.7/site-packages/torch_xla/core/xla_model.py", line 139, in get_xla_supported_devices
    xla_devices = _DEVICES.value
  File "/anaconda3/envs/torch-xla-1.13/lib/python3.7/site-packages/torch_xla/utils/utils.py", line 32, in value
    self._value = self._gen_fn()
  File "/anaconda3/envs/torch-xla-1.13/lib/python3.7/site-packages/torch_xla/core/xla_model.py", line 21, in <lambda>
    _DEVICES = xu.LazyProperty(lambda: torch_xla._XLAC._xla_get_devices())
RuntimeError: tensorflow/compiler/xla/xla_client/mesh_service.cc:329 : Check failed: impl_->channel->WaitForConnected( std::chrono::system_clock::now() + std::chrono::seconds(connect_wait_seconds)) 
*** Begin stack trace ***
        tsl::CurrentStackTrace[abi:cxx11]()
        xla::service::MeshClient::MeshClient(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
        xla::service::MeshClient::Get()
        xla::ComputationClient::Create()

        xla::ComputationClient::Get()

        _PyMethodDef_RawFastCallKeywords
        _PyCFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallDict

        _PyObject_GenericGetAttrWithDict
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault

        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallDict
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallDict
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        PyEval_EvalCodeEx
        PyEval_EvalCode

        PyRun_StringFlags
        PyRun_SimpleStringFlags

        _Py_UnixMain
        __libc_start_main

*** End stack trace ***
Failed to connect to client mesh master: xiowei-2vm-red-vm.c.tpu-pytorch.internal:44145
Exception in device=TPU:6: tensorflow/compiler/xla/xla_client/mesh_service.cc:329 : Check failed: impl_->channel->WaitForConnected( std::chrono::system_clock::now() + std::chrono::seconds(connect_wait_seconds)) 
*** Begin stack trace ***
        tsl::CurrentStackTrace[abi:cxx11]()
        xla::service::MeshClient::MeshClient(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
        xla::service::MeshClient::Get()
        xla::ComputationClient::Create()

        xla::ComputationClient::Get()

        _PyMethodDef_RawFastCallKeywords
        _PyCFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallDict

        _PyObject_GenericGetAttrWithDict
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault

        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallDict
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallDict
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        PyEval_EvalCodeEx
        PyEval_EvalCode

        PyRun_StringFlags
        PyRun_SimpleStringFlags

        _Py_UnixMain
        __libc_start_main

*** End stack trace ***
Failed to connect to client mesh master: xiowei-2vm-red-vm.c.tpu-pytorch.internal:44145
Traceback (most recent call last):
Exception in device=TPU:4: tensorflow/compiler/xla/xla_client/mesh_service.cc:329 : Check failed: impl_->channel->WaitForConnected( std::chrono::system_clock::now() + std::chrono::seconds(connect_wait_seconds)) 
*** Begin stack trace ***
        tsl::CurrentStackTrace[abi:cxx11]()
        xla::service::MeshClient::MeshClient(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
        xla::service::MeshClient::Get()
        xla::ComputationClient::Create()

        xla::ComputationClient::Get()

        _PyMethodDef_RawFastCallKeywords
        _PyCFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallDict

        _PyObject_GenericGetAttrWithDict
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault

        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallDict
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallDict
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        PyEval_EvalCodeEx
        PyEval_EvalCode

        PyRun_StringFlags
        PyRun_SimpleStringFlags

        _Py_UnixMain
        __libc_start_main

*** End stack trace ***
Failed to connect to client mesh master: xiowei-2vm-red-vm.c.tpu-pytorch.internal:44145
Traceback (most recent call last):
  File "/anaconda3/envs/torch-xla-1.13/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 331, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/anaconda3/envs/torch-xla-1.13/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 324, in _start_fn
    _setup_replication()
  File "/anaconda3/envs/torch-xla-1.13/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 316, in _setup_replication
    device = xm.xla_device()
  File "/anaconda3/envs/torch-xla-1.13/lib/python3.7/site-packages/torch_xla/core/xla_model.py", line 245, in xla_device
    devices = get_xla_supported_devices(devkind=devkind)
  File "/anaconda3/envs/torch-xla-1.13/lib/python3.7/site-packages/torch_xla/core/xla_model.py", line 139, in get_xla_supported_devices
    xla_devices = _DEVICES.value
  File "/anaconda3/envs/torch-xla-1.13/lib/python3.7/site-packages/torch_xla/utils/utils.py", line 32, in value
    self._value = self._gen_fn()
  File "/anaconda3/envs/torch-xla-1.13/lib/python3.7/site-packages/torch_xla/core/xla_model.py", line 21, in <lambda>
    _DEVICES = xu.LazyProperty(lambda: torch_xla._XLAC._xla_get_devices())
RuntimeError: tensorflow/compiler/xla/xla_client/mesh_service.cc:329 : Check failed: impl_->channel->WaitForConnected( std::chrono::system_clock::now() + std::chrono::seconds(connect_wait_seconds)) 
*** Begin stack trace ***
        tsl::CurrentStackTrace[abi:cxx11]()
        xla::service::MeshClient::MeshClient(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
        xla::service::MeshClient::Get()
        xla::ComputationClient::Create()

        xla::ComputationClient::Get()

        _PyMethodDef_RawFastCallKeywords
        _PyCFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallDict

        _PyObject_GenericGetAttrWithDict
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault

        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallDict
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallDict
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        PyEval_EvalCodeEx
        PyEval_EvalCode

        PyRun_StringFlags
        PyRun_SimpleStringFlags

        _Py_UnixMain
        __libc_start_main

*** End stack trace ***
Failed to connect to client mesh master: xiowei-2vm-red-vm.c.tpu-pytorch.internal:44145
  File "/anaconda3/envs/torch-xla-1.13/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 331, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/anaconda3/envs/torch-xla-1.13/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 324, in _start_fn
    _setup_replication()
  File "/anaconda3/envs/torch-xla-1.13/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 316, in _setup_replication
    device = xm.xla_device()
  File "/anaconda3/envs/torch-xla-1.13/lib/python3.7/site-packages/torch_xla/core/xla_model.py", line 245, in xla_device
    devices = get_xla_supported_devices(devkind=devkind)
  File "/anaconda3/envs/torch-xla-1.13/lib/python3.7/site-packages/torch_xla/core/xla_model.py", line 139, in get_xla_supported_devices
    xla_devices = _DEVICES.value
  File "/anaconda3/envs/torch-xla-1.13/lib/python3.7/site-packages/torch_xla/utils/utils.py", line 32, in value
    self._value = self._gen_fn()
  File "/anaconda3/envs/torch-xla-1.13/lib/python3.7/site-packages/torch_xla/core/xla_model.py", line 21, in <lambda>
    _DEVICES = xu.LazyProperty(lambda: torch_xla._XLAC._xla_get_devices())
RuntimeError: tensorflow/compiler/xla/xla_client/mesh_service.cc:329 : Check failed: impl_->channel->WaitForConnected( std::chrono::system_clock::now() + std::chrono::seconds(connect_wait_seconds)) 
*** Begin stack trace ***
        tsl::CurrentStackTrace[abi:cxx11]()
        xla::service::MeshClient::MeshClient(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
        xla::service::MeshClient::Get()
        xla::ComputationClient::Create()

        xla::ComputationClient::Get()

        _PyMethodDef_RawFastCallKeywords
        _PyCFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallDict

        _PyObject_GenericGetAttrWithDict
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault

        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallDict
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallDict
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        PyEval_EvalCodeEx
        PyEval_EvalCode

        PyRun_StringFlags
        PyRun_SimpleStringFlags

        _Py_UnixMain
        __libc_start_main

*** End stack trace ***
Failed to connect to client mesh master: xiowei-2vm-red-vm.c.tpu-pytorch.internal:44145
Traceback (most recent call last):
  File "/usr/share/torch-xla-1.12/tpu-examples/deps/dlrm/dlrm_tpu_runner.py", line 15, in <module>
    xmp.spawn(main, args=(), nprocs=pre_spawn_flags.tpu_cores)
  File "/anaconda3/envs/torch-xla-1.13/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 399, in spawn
    start_method=start_method)
  File "/anaconda3/envs/torch-xla-1.13/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/anaconda3/envs/torch-xla-1.13/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 154, in join
    exit_code=exitcode
torch.multiprocessing.spawn.ProcessExitedException: process 2 terminated with exit code 17

To Reproduce

In project tpu-pytorch, search for the 2vm red image xiowei-2vm-red-image. Use it to create a red VM. Inside the VM, do

vm:~$ gcloud compute tpus create your-2vm-greenvm \
> --zone=us-central1-b \
> --network=default \
> --version=pytorch-1.13  \
> --accelerator-type=v3-8

vm:~$ conda activate torch-xla-1.13

vm:~$ gcloud compute tpus describe your-2vm-greenvm --zone=us-central1-b

vm:~$ export TPU_IP_ADDRESS="replace with your ip address"

vm:~$ export XRT_TPU_CONFIG="tpu_worker;0;$TPU_IP_ADDRESS:8470"

vm:~$ pip install onnx

# torch-xla-1.12 version of the test should work.
vm:~$ python /usr/share/torch-xla-1.12/tpu-examples/deps/dlrm/dlrm_tpu_runner.py

Expected behavior

It should succeed.

Environment

Additional context

JackCaoG commented 2 years ago

can you get resnet50 running with fakedata on the same TPUVM? Want to make sure this is not caused by a setup issue.

vanbasten23 commented 2 years ago

In the same red vm, running resnet50 with fakedata succeeds:

(base) xiowei@xioweicloudtop1:~/pytorch/xla$ gcloud compute ssh xiowei-dlrm-tutorial --zone=us-central1-a
Last login: Mon Oct 24 17:20:00 2022 from 216.239.45.216
xiowei@xiowei-dlrm-tutorial:~$ conda activate torch-xla-1.13
(torch-xla-1.13) xiowei@xiowei-dlrm-tutorial:~$ export TPU_IP_ADDRESS=10.35.41.18
(torch-xla-1.13) xiowei@xiowei-dlrm-tutorial:~$ export XRT_TPU_CONFIG="tpu_worker;0;$TPU_IP_ADDRESS:8470"
(torch-xla-1.13) xiowei@xiowei-dlrm-tutorial:~$ git clone --recursive https://github.com/pytorch/xla.git
(torch-xla-1.13) xiowei@xiowei-dlrm-tutorial:~$ python xla/test/test_train_mp_imagenet.py --fake_data --model=resnet50 --num_epochs=1
==> Preparing data..
Epoch 1 train begin 18:25:04
==> Preparing data..
==> Preparing data..
==> Preparing data..
==> Preparing data..
==> Preparing data..
==> Preparing data..
==> Preparing data..
| Training Device=xla:0/7 Epoch=1 Step=0 Loss=6.89059 Rate=4.06 GlobalRate=4.06 Time=18:25:39
| Training Device=xla:0/1 Epoch=1 Step=0 Loss=6.89059 Rate=4.09 GlobalRate=4.09 Time=18:25:39
| Training Device=xla:0/3 Epoch=1 Step=0 Loss=6.89059 Rate=3.99 GlobalRate=3.99 Time=18:25:39
| Training Device=xla:0/6 Epoch=1 Step=0 Loss=6.89059 Rate=3.93 GlobalRate=3.93 Time=18:25:39
| Training Device=xla:0/4 Epoch=1 Step=0 Loss=6.89059 Rate=3.97 GlobalRate=3.97 Time=18:25:39
| Training Device=xla:0/5 Epoch=1 Step=0 Loss=6.89059 Rate=4.11 GlobalRate=4.11 Time=18:25:39
...
| Training Device=xla:0/3 Epoch=1 Step=1160 Loss=0.00136 Rate=590.95 GlobalRate=452.56 Time=18:30:35
Epoch 1 train end 18:30:38
| Test Device=xla:0/3 Step=0 Epoch=1 Time=18:30:42
| Test Device=xla:0/7 Step=0 Epoch=1 Time=18:30:42
| Test Device=xla:1/0 Step=0 Epoch=1 Time=18:30:42
| Test Device=xla:0/2 Step=0 Epoch=1 Time=18:30:42
| Test Device=xla:0/4 Step=0 Epoch=1 Time=18:30:42
| Test Device=xla:0/6 Step=0 Epoch=1 Time=18:30:42
| Test Device=xla:0/1 Step=0 Epoch=1 Time=18:30:42
| Test Device=xla:0/5 Step=0 Epoch=1 Time=18:30:42
| Test Device=xla:0/4 Step=20 Epoch=1 Time=18:30:48
| Test Device=xla:0/1 Step=20 Epoch=1 Time=18:30:48
| Test Device=xla:0/3 Step=20 Epoch=1 Time=18:30:48
| Test Device=xla:0/2 Step=20 Epoch=1 Time=18:30:48
| Test Device=xla:1/0 Step=20 Epoch=1 Time=18:30:48
| Test Device=xla:0/5 Step=20 Epoch=1 Time=18:30:48
| Test Device=xla:0/7 Step=20 Epoch=1 Time=18:30:48
| Test Device=xla:0/6 Step=20 Epoch=1 Time=18:30:48
| Test Device=xla:0/1 Step=40 Epoch=1 Time=18:30:49
| Test Device=xla:0/3 Step=40 Epoch=1 Time=18:30:49
| Test Device=xla:1/0 Step=40 Epoch=1 Time=18:30:49
| Test Device=xla:0/6 Step=40 Epoch=1 Time=18:30:49
| Test Device=xla:0/2 Step=40 Epoch=1 Time=18:30:49
| Test Device=xla:0/5 Step=40 Epoch=1 Time=18:30:49
| Test Device=xla:0/7 Step=40 Epoch=1 Time=18:30:49
| Test Device=xla:0/4 Step=40 Epoch=1 Time=18:30:49
Epoch 1 test end 18:30:49, Accuracy=100.00
Max Accuracy: 100.00%
(torch-xla-1.13) xiowei@xiowei-dlrm-tutorial:~$