pytorch / xla

Enabling PyTorch on XLA Devices (e.g. Google TPU)
https://pytorch.org/xla
Other
2.45k stars 462 forks source link

`TestExpandSymInt` fails due to incorrect dynamism report by `isDynamic()` #3680

Closed miladm closed 2 years ago

miladm commented 2 years ago

In running TestExpandSymInt (ref PR, I run into the following error. The error suggests is_symbolic returns an incorrect value when calling toSymbolicIntNode() method. This seems to suggest the upstream API call needs investigation. @Gamrix wdyt?

(base) $ source  xlaCppTest.sh ExpandSymInt
Note: Google Test filter = AtenXlaTensorTest.TestExpandSymInt
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from AtenXlaTensorTest
[ RUN      ] AtenXlaTensorTest.TestExpandSymInt
2022-07-04 06:27:20.555229: I  178569 tensorflow/core/tpu/tpu_initializer_helper.cc:253] Libtpu path is: libtpu.so
2022-07-04 06:27:20.555808: I  178569 tensorflow/compiler/xla/xla_client/xrt_local_service.cc:55] libtpu status: OK
2022-07-04 06:27:20.555859: I  178569 tensorflow/compiler/xla/xla_client/xrt_local_service.cc:41] Peer localservice 1 {localhost:40934}
2022-07-04 06:27:20.556054: I  178569 tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-07-04 06:27:20.573048: W  178569 tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2022-07-04 06:27:20.573096: W  178569 tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-07-04 06:27:20.573133: I  178569 tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (bc9148e3e599): /proc/driver/nvidia/version does not exist
2022-07-04 06:27:20.620601: I  178569 tensorflow/compiler/xla/service/service.cc:174] XLA service 0x17ad290 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2022-07-04 06:27:20.620662: I  178569 tensorflow/compiler/xla/service/service.cc:182]   StreamExecutor device (0): Host, Default Version
2022-07-04 06:27:20.672979: I  178569 tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job localservice -> {0 -> localhost:40934}
2022-07-04 06:27:20.674373: I  178569 tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:438] Started server with target: grpc://localhost:40934
2022-07-04 06:27:20.766387: I  179241 tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled
2022-07-04 06:27:20.776891: I  178865 tensorflow/compiler/jit/xla_device.cc:429] XLA_GPU and XLA_CPU devices are deprecated and will be removed in subsequent releases. Instead, use either @tf.function(jit_compile=True) for must-compile semantics, or run with TF_XLA_FLAGS=--tf_xla_auto_jit=2 for auto-clustering best-effort compilation.
unknown file: Failure
C++ exception with description "Expected is_symbolic() to be true, but got false.  (Could this error message be improved?  If so, please report an enhancement request to PyTorch.)
Exception raised from toSymbolicIntNode at /workspace/pytorch/c10/core/SymInt.cpp:9 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x7d (0x7f048e8bd01d in /workspace/pytorch/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0xdd (0x7f048e8bb84d in /workspace/pytorch/torch/lib/libc10.so)
frame #2: <unknown function> + 0x1dd41 (0x7f048e8a5d41 in /workspace/pytorch/torch/lib/libc10.so)
frame #3: torch_xla::SymIntElements::SetSymIntNodeElements(c10::SymInt&) + 0x1b (0x7f048e6f9f8b in /workspace/pytorch/xla/build/lib.linux-x86_64-3.7/libptxla.so)
frame #4: torch_xla::SymIntElements::SymIntElements(c10::SymIntArrayRef&) + 0x9c (0x7f048e29399c in /workspace/pytorch/xla/build/lib.linux-x86_64-3.7/libptxla.so)
frame #5: torch_xla::XLANativeFunctions::expand_symint(at::Tensor const&, c10::SymIntArrayRef, bool) + 0x5a (0x7f048e24aaca in /workspace/pytorch/xla/build/lib.linux-x86_64-3.7/libptxla.so)
frame #6: <unknown function> + 0x24da98 (0x7f048e31ca98 in /workspace/pytorch/xla/build/lib.linux-x86_64-3.7/libptxla.so)
frame #7: at::_ops::expand_SymInt::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::SymIntArrayRef, bool) + 0x80 (0x7f0463f6edf0 in /workspace/pytorch/torch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x3b4be4a (0x7f04660e2e4a in /workspace/pytorch/torch/lib/libtorch_cpu.so)
frame #9: at::_ops::expand_SymInt::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::SymIntArrayRef, bool) + 0x80 (0x7f0463f6edf0 in /workspace/pytorch/torch/lib/libtorch_cpu.so)
frame #10: <unknown function> + 0x324d68b (0x7f04657e468b in /workspace/pytorch/torch/lib/libtorch_cpu.so)
frame #11: at::_ops::expand_SymInt::call(at::Tensor const&, c10::SymIntArrayRef, bool) + 0x156 (0x7f0463f6eac6 in /workspace/pytorch/torch/lib/libtorch_cpu.so)
frame #12: /workspace/pytorch/xla/test/cpp/build/test_ptxla() [0x6bb281]
frame #13: torch_xla::cpp_test::ForEachDevice(absl::lts_20211102::Span<torch_xla::DeviceType const>, std::function<void (c10::Device const&)> const&) + 0x140 (0x592110 in /workspace/pytorch/xla/test/cpp/build/test_ptxla)
frame #14: torch_xla::cpp_test::AtenXlaTensorTest_TestExpandSymInt_Test::TestBody() + 0xf7 (0x607507 in /workspace/pytorch/xla/test/cpp/build/test_ptxla)
frame #15: void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) + 0x7e (0x7bee5e in /workspace/pytorch/xla/test/cpp/build/test_ptxla)
frame #16: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) + 0x7b (0x7a37db in /workspace/pytorch/xla/test/cpp/build/test_ptxla)
frame #17: testing::Test::Run() + 0xd9 (0x77d3e9 in /workspace/pytorch/xla/test/cpp/build/test_ptxla)
frame #18: testing::TestInfo::Run() + 0x10d (0x77e19d in /workspace/pytorch/xla/test/cpp/build/test_ptxla)
frame #19: testing::TestSuite::Run() + 0x110 (0x77ea00 in /workspace/pytorch/xla/test/cpp/build/test_ptxla)
frame #20: testing::internal::UnitTestImpl::RunAllTests() + 0x473 (0x78f6c3 in /workspace/pytorch/xla/test/cpp/build/test_ptxla)
frame #21: bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) + 0x7e (0x7c2a6e in /workspace/pytorch/xla/test/cpp/build/test_ptxla)
frame #22: bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) + 0x7b (0x7a616b in /workspace/pytorch/xla/test/cpp/build/test_ptxla)
frame #23: testing::UnitTest::Run() + 0xd4 (0x78f204 in /workspace/pytorch/xla/test/cpp/build/test_ptxla)
frame #24: main + 0x1c (0x58fc6c in /workspace/pytorch/xla/test/cpp/build/test_ptxla)
frame #25: __libc_start_main + 0xe7 (0x7f0461c0ac87 in /lib/x86_64-linux-gnu/libc.so.6)
frame #26: _start + 0x2a (0x58fb8a in /workspace/pytorch/xla/test/cpp/build/test_ptxla)
" thrown in the test body.
[  FAILED  ] AtenXlaTensorTest.TestExpandSymInt (245 ms)
[----------] 1 test from AtenXlaTensorTest (245 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (245 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] AtenXlaTensorTest.TestExpandSymInt

FWIW, TestExpand runs successfully as expected.

CC @Krovatkin

Gamrix commented 2 years ago

I suspect the error is arising from the the use of toSymbolicIntNode() in LazyNativeFunctions::narrow_copy_symint(), as it is the only location where there is not a check of whether or not the node is symbolic.

miladm commented 2 years ago

Here is the code snippet that produces the error.

miladm commented 2 years ago

@Gamrix are you referring to this this code?

IIUC, you think we are missing a check to determine if a node is symbolic; though the error message suggests the check is done with an incorrect return output. How would a missing check help?

Curious to hear your guidance.

miladm commented 2 years ago

This issue has been addressed by adding the following logic to SymIntElements::SetSymIntNodeElements() in torch_util.cpp:

if size->is_symbolic() {
   // Handle Symbolic Int
}
else {
  // create an IR node representing a constant for concrete int
}
miladm commented 2 years ago

This commit has the implementation.

miladm commented 2 years ago

Closing.