isl-org / Open3D

Open3D: A Modern Library for 3D Data Processing
http://www.open3d.org
Other
11.28k stars 2.28k forks source link

Inconsistent output between CPU & CUDA for Tensor::To and/or Tensor::Sum #5577

Open Algomorph opened 1 year ago

Algomorph commented 1 year ago

Checklist

Describe the issue

On v0.15.1 release working in C++, I'm getting inconsistent outputs between the same exact functions (e.g. Tensor::To, Tensor::Sum, and/or t::geometry::Image constructor) with only the device being different.

Steps to reproduce the bug

I'll give you a sample of my code -- almost MRE -- which points to the bug. It should be very easy to convert this to an MRE. If it isn't obvious, try changing the device of the input tensors and observing the output difference.

auto diff = pixel_face_indices.IsClose(pixel_face_indices_ground_truth).LogicalNot(); // the tensors being compared are two open3d::core::Int32 contiguous tensors (e.g. of size 480 x 640 x 1) with matching device.
// side note: in my case, the "_ground_truth" tensor was loaded from a numpy array and converted to the device in question using "Tensor::To". The other tensor was generated.

auto diff_image = open3d::t::geometry::Image(diff.To(open3d::core::UInt8) * 255);
open3d::t::io::WriteImage(test::generated_image_test_data_directory.ToString() + "/" + mesh_name + "_diff_mask.png", diff_image); // feel free to adjust to some simple local path. The image is pure black for CPU version, completely inconsistent with even the CPU Sum() output.

std::cout << diff.NonZero().GetShape().ToString() << std::endl; // prints correct result for both CPU and CUDA devices.
std::cout << diff.GetShape().ToString() << std::endl; // same for both device types, as it should be
std::cout << diff.GetDtype().ToString() << std::endl; // same for both device types, as it should be
std::cout << diff.GetDevice().ToString() << std::endl; // prints proper device type for each device
std::cout << diff.To(open3d::core::Int64).Sum({0, 1, 2}).ToString() << std::endl; // not sure what's going on, but output is wrong for CPU here.

Error message

No error messages, but the output is incorrect / inconsistent between CPU and CUDA tensors.

Expected behavior

Result for CUDA seems correct. The image is also correct -- white for areas that are different between the two input tensors.

{3, 1536}
{480, 640, 1}
Bool
CUDA:0
1536

Here's output of the same code with CPU tensors:

{3, 1536}
{480, 640, 1}
Bool
CPU:0
391680

And the image generated is pure black, which makes no sense...

Open3D, Python and System information

- Operating system: Ubuntu 20.04
- Python version: you don't need this here
- Open3D version: 0.15.1+ed30e3b6
- System architecture: arm64
- Is this a remote workstation?: no
- How did you install Open3D?: build from source
- Compiler version (if built from source): gcc 9.4

Additional information

It would maybe make sense to prepare for other C++ bugs -- I had to manually go back in and edit the "Steps" entry in the bug template, since that was only set up for Python.

ssheorey commented 1 year ago

Hi @Algomorph it's not clear which of the function calls are possibly erroneous. Can you test with small tensors (say 4x6) and print out the results of the function calls to narrow the issue to a single function call? Or alternately provide the data (pixelface* tensors) to reproduce this.

Algomorph commented 1 year ago

@ssheorey I won't have time for that this week or next, maybe longer.