Open zplizzi opened 2 years ago
Don't use equal
to compare floating point tensors. This is expected behavior, see https://pytorch.org/docs/stable/notes/numerical_accuracy.html. Channels-last and not channels last convolutions use different kernels and produce slightly different results (so allclose
would work, but equal
would not).
equal
is appropriate for comparing results that you expect to be deterministic, is it not? This is the whole point of running code in deterministic mode - ensuring exactly identical results. My point in raising this issue is to note that Conv2d is not
deterministic in a case that many people may expect it would be - and that this behavior could be documented (or fixed, if running in deterministic mode) to save others the confusion that I had when discovering this.
Also, perhaps equal
could have a flag to enable checking if the memory format of the two tensors is identical also? Or at least a note in the docs that it doesn't compare memory formats. Because it is very confusing to get different results from two tensors that are equal
.
And for what it's worth, the code example I gave above also fails allclose
with default tolerances, the numerical difference is slightly larger than the default tolerance of allclose
.
I agree the docs could be clearer, we don't clearly define what input means:
Sets whether PyTorch operations must use “deterministic” algorithms. That is, algorithms which, given the same input, and when run on the same software and hardware, always produce the same output. When enabled, operations will use deterministic algorithms when available, and if only nondeterministic algorithms are available they will throw a RuntimeError when called.
🐛 Describe the bug
I would expect that if I pass two identical tensors through a Conv2d in deterministic mode, they would produce an identical output. However this is not the case if the tensors are identical in every way except their stride - in that case the output is different. A difference in stride doesn't cause the two tensors to not be
torch.equal
, which is especially confusing - two "equal" tensors can produce a different output.Ideally this should be fixed such that differences in stride don't affect the output of otherwise-deterministic operations, but at minimum the page on reproducibility should mention this.
Versions
Collecting environment information... PyTorch version: 1.13.0+cu116 Is debug build: False CUDA used to build PyTorch: 11.6 ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.5 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 Clang version: Could not collect CMake version: version 3.16.3 Libc version: glibc-2.31
Python version: 3.9.5 (default, Nov 23 2021, 15:27:38) [GCC 9.3.0] (64-bit runtime) Python platform: Linux-5.13.0-30-generic-x86_64-with-glibc2.31 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090 GPU 1: NVIDIA GeForce RTX 3090 GPU 2: NVIDIA GeForce RTX 3090
Nvidia driver version: 470.103.01 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True
Versions of relevant libraries: [pip3] geotorch==0.2.0 [pip3] mypy==0.971 [pip3] mypy-boto3-ec2==1.17.41.0 [pip3] mypy-boto3-s3==1.17.41.0 [pip3] mypy-extensions==0.4.3 [pip3] numpy==1.22.2 [pip3] pytorch-lightning==1.7.3 [pip3] torch==1.13.0+cu116 [pip3] torchmetrics==0.7.0 [pip3] torchvision==0.13.0+cu113 [conda] Could not collect