isl-org / Open3D

Open3D: A Modern Library for 3D Data Processing
http://www.open3d.org
Other
11.56k stars 2.32k forks source link

Segmentation Fault after registeration process in Reconstruction pipeline #2107

Closed zainmehdi closed 3 years ago

zainmehdi commented 4 years ago

I am trying to run reconstruction pipeline with the python scripts provided. I recorded my data using realsense d435i. When I run the script it is able to register the fragment but just after this phase it gives me segmentation fault.

To Reproduce I used the sample dataset sequence 16 as provided in the tutorial. If you try to run the pipeline it should give segmentation fault. The below screenshot is however of my custom data but the results were same in both cases.

python run_system.py config/realsense.json --make --register --refine --integrate Expected behavior Segmentation fault

Screenshots If applicable, add screenshots to help explain your problem. image

I am running in in conda base environment

Additional context

griegler commented 4 years ago

@theNded can you have a look at this please.

zainmehdi commented 4 years ago

Ok so there is a definite memory leak somewhere. I have a core i7 2.6 Ghz with 16GB of RAM and its consuming all the memory and CPU usage is very high as well. Here is the screen shot image

theNded commented 4 years ago

There are some issues with python multithread regarding memory usage. Can you try to turn it off in the config file and try again? Another potential issue can be the failure of global registration. In global registration I can see there are 19 nodes and 18 edges, which means no valid loop closure is detected. This may cause incorrect integration into an unreasonably large space.

zainmehdi commented 4 years ago

@theNded Thanks for the reply. I tried turning multi threading option false and even then it causes segmentation fault if the voxel size is smaller than 5cm and depth scale in image is beyond 3m.

theNded commented 4 years ago

Does 5cm and 3m work? In fact voxel size in config file is a little bit confusing -- the real voxel size is 5cm / 512 (see https://github.com/intel-isl/Open3D/blob/master/examples/python/ReconstructionSystem/integrate_scene.py#L24). So it can be good to check the result of this default configuration.

Another suggestion can be checking the reconstruction of fragments in the fragment folder using https://github.com/intel-isl/Open3D/blob/master/examples/python/ReconstructionSystem/debug/visualize_fragments.py.

zainmehdi commented 4 years ago

Yes 5cm and 3m work. And I checked the results of the fragments. It produces ply files. Here are few more interesting observations.

theNded commented 4 years ago
  • If I try to run Fragmentation part with multi-threading it crashes. Running it without multi threading works with the above mentioned voxel and depth values. Else it fails even in single threading mode.

For multithread you may want to tune MAX_THREAD here: https://github.com/intel-isl/Open3D/blob/master/examples/python/ReconstructionSystem/make_fragments.py#L181

It makes sense if the resolution is very high, otherwise it will be interesting to look at the log.

  • The registration part doesnt work with single thread no matter what the voxel and the depth values. If I change the config to multi-threading it works.

This is weird, can you put the log here?

zainmehdi commented 4 years ago

Here is the log for the second case. For the first case I will do more test and share some logs with you. Configuration


                                    name : Captured frames using Realsense
                            path_dataset : dataset/realsense/
                          path_intrinsic : dataset/realsense/camera_intrinsic.json
                               max_depth : 3.0
                              voxel_size : 0.1
                          max_depth_diff : 0.07
        preference_loop_closure_odometry : 0.1
    preference_loop_closure_registration : 5.0
                         tsdf_cubic_size : 3.0
                              icp_method : color
                     global_registration : ransac
                  python_multi_threading : False
                          depth_map_type : redwood
                   n_frames_per_fragment : 100
                 n_keyframes_per_n_frame : 5
                               min_depth : 0.3
                         folder_fragment : fragments/
             template_fragment_posegraph : fragments/fragment_%03d.json
   template_fragment_posegraph_optimized : fragments/fragment_optimized_%03d.json
            template_fragment_pointcloud : fragments/fragment_%03d.ply
                            folder_scene : scene/
               template_global_posegraph : scene/global_registration.json
     template_global_posegraph_optimized : scene/global_registration_optimized.json
              template_refined_posegraph : scene/refined_registration.json
    template_refined_posegraph_optimized : scene/refined_registration_optimized.json
                    template_global_mesh : scene/integrated.ply
                    template_global_traj : scene/trajectory.log
                              debug_mode : False
Filament library loaded.
register fragments.
reading dataset/realsense/fragments/fragment_000.ply ...
Segmentation fault (core dumped)
theNded commented 4 years ago

It would be nice if you can share your data somewhere, I think I have to take a look closely...

zainmehdi commented 4 years ago

https://drive.google.com/file/d/1O0CRAu7GJaY1J19l7Qk_Gnh9qfzfmxcG/view?usp=sharing Here is the data

theNded commented 4 years ago

Sorry for coming back late. There are some problems with the data:

  1. You are not using the default intrinsic matrix -- you need to modify path_intrinsic and point it to your specific intrinsic config file. the config seems correct, but somehow the fragments look distorted...
  2. More than 50% of the scene are glasses -- this is very challenging for RGB-D sensors. Your sensor will either return invalid value or 2x distances (caused by the mirror reflection).
zainmehdi commented 4 years ago

Thanks alot. I will look into it. I tried with Azure Kinect and results were better.

devernay commented 4 years ago

0.11.0 has this same memory leak issue with the reconstruction_system tutorial (using the data from 016.zip), so I don't think the data is faulty here. It fails on a machine with 256Gb RAM (keeping the default config parameters). open3d.pipelines.integration.ScalableTSDFVolume.integrate seems to be responsible for this leak, as suggested by the crash report above, which is similar to the one I got, with a more explicit error message (I checked using htop, and it fills up the 256Gb+swap pretty quickly):

...
Fragment 002 / 013 :: integrate rgbd frame 287 (88 of 100).
Traceback (most recent call last):
  File "run_system.py", line 68, in <module>
    make_fragments.run(config)
  File ".../Open3D/examples/python/reconstruction_system/make_fragments.py", line 182, in run
    Parallel(n_jobs=MAX_THREAD)(delayed(process_single_fragment)(
  File ".../lib/python3.8/site-packages/joblib/parallel.py", line 1061, in __call__
    self.retrieve()
  File ".../lib/python3.8/site-packages/joblib/parallel.py", line 940, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File ".../lib/python3.8/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
    return future.result(timeout=timeout)
  File ".../lib/python3.8/concurrent/futures/_base.py", line 439, in result
    return self.__get_result()
  File ".../lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
    raise self._exception
joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.

The exit codes of the workers are {SIGKILL(-9)}

Tested with 0.10.0: works fine. Anyone encountering this problem should downgrade Open3D for now.

nanotuxi commented 4 years ago

Can confirm this behaviour. Started experimenting with downgrading Open3D to 0.10 and run reconstruction on the 0.11.1 codebase. The behaviour is this: Running the same reconstruction with images from a D435i runs -fine with v0.10 and multithreading activated in config. -shows memory leak behaviour on a 16GB machine with 4GB swap with 0.11.1 -runs fine with multithreading set to false in config file with 0.11.1

This only works with python run_system.py config/realsense.json --make as TransformationEstimationForColoredICPis new in 0.11 So this bug seems to be re-introduced with this pr? [#2497]

nanotuxi commented 4 years ago

This solution does not work. So I reverted to open3d v0.10 as @devernay suggested. ~~So the solution for me is to include the patch from pr #2562 and set "python_multi_threading": false in config/realsense.json After that the reconstruction pipeline works ...fast Hui ;.) Thanks to @nachovizzo and @stachnis for #2497.~~

nanotuxi commented 4 years ago

@nachovizzo Is it possible to reduce the amount of memory used for refinement. Running this on a recorded set of images captured with a D435i the machine uses about all of the physical memory (16GB) and additional 4-5GB swap space. I did not try (and maybe never do) v0.11.x on a Jetson Nano as this wouldn't work. So can we tweak settings to reduce the amount of memory to make this work on an edge device like the jetson?

nachovizzo commented 4 years ago

@nanotuxi I guess you got the wrong person :) I haven't been working on the reconstruction pipeline for Open3D. That PR you mention(#2497) it's just an extension of the robust kernels we've added to the Open3D registration pipeline, but not the reconstruction. All the information about Robust kernels can be found on http://www.open3d.org/docs/release/tutorial/pipelines/robust_kernels.html But I guess this won't help you!

Best luck!

devernay commented 4 years ago

@theNded can you confirm that you observe the same regression between 0.10 and 0.11 on the tutorial data? Any idea where the leak may come from?

devernay commented 4 years ago

Just FYI this is not a memory leak, as everything is freed correctly on exit, but valgrind --tool=massif shows there are huge allocations of vector (2.5GB for 10 640x480 depth frames) by the constructor of UniformTSDFVolume, which is called from ScalableTSDFVolume::OpenVolumeUnit(), and probably not freed until the ScalableTSDFVolume is deteted.

nanotuxi commented 4 years ago

Yes. It seems that the problem appears with rgbd images being loaded completely into memory which makes refining a scene unusable at the moment (for me :-) Maybe related to #2372 ? And maybe this diff ?

devernay commented 4 years ago

Here's a sample run with 200 frames, grouped into 10 fragments of 20 frames. you see that a fragment takes almost 8Gb when being built. I observed 2.5Gb with 10 frames, so I guess it may be roughly quadratic with the number of frames in a fragment, which explains why it crashes at about 88 frames on a 256Gb machine image image

theNded commented 4 years ago

Apologize for the late reply, I have reproduced the problem and will be actively working on it. Please keep tuned.

theNded commented 4 years ago

I can observe that the pose graph of a fragment is irregular (see below). This will cause the ScalableTSDFVolume to continuously activate new blocks in the unobserved 3D space (an expected behavior), which will cause the memory explosion. So this problem converges to the initial issue -- when the camera poses are wrong, the memory will be consumed. ScreenCapture_2020-11-03-00-25-34

Now the issue is that why RGBD odometry failed, causing a wrong pose graph. #2497 touched this part, so it is possible that the problem is introduced in this PR. I'm investigating this.

Also, this is a good warning. I will put it on my task list to add an exit flag in the reconstruction system: when poses are wrong and too much memory is allocated for TSDF volumes, we exit and post a message telling the user that there might be an issue with data and or odometry/registration.

nanotuxi commented 4 years ago

@theNded Thanks.

theNded commented 4 years ago

Update: I reverted to d72d2a7, the commit before #2497, and the problem is still there. So @nachovizzo is innocent :-)

This is the incorrect RGBD odometry result of two adjacent frames. Apparently the transformation is supposed to be identical. ScreenCapture_2020-11-03-00-52-33

Update: Identified problem is with pybind. cpp example works as expected.

nanotuxi commented 4 years ago

@theNded Thanks :-) But this was not a judgement. It was a try to find the reason for a problem.

theNded commented 4 years ago

Confirmed this bug is introduced by #2497. While I am re-reviewing the code, it will be nice that @nachovizzo can help me check the potential problems. And any help from the active users are greatly appreciated :-)

Here we estimate RGB-D odometry between frame 1 and frame 2 for the tutorial dataset.

Results for 30329b2d44ccea682f088b3bc8909e956ae02ac4 (after introducing #2497) ScreenCapture_2020-11-03-01-47-20

 0.655147 -0.746888 -0.113757  0.215351
 0.576672  0.591647 -0.563386   1.49306
  0.48809  0.303501  0.818325  -1.47575
        0         0         0         1

Incorrect results for both cpp and python.

Results for d72d2a721d5d269bf93b430ed28661d8c6e5ecdd (before introducing #2497) ScreenCapture_2020-11-03-01-58-49

    0.999995  -0.00316241  0.000960094  -0.00110673
  0.00316085     0.999994   0.00161349    0.0002658
-0.000965191  -0.00161045     0.999998 -0.000378882
           0            0            0            1

Correct results for both cpp and python.

Correct fragment reconstruction for the reconstruction_system snapshot00

theNded commented 4 years ago

Bug found: r and w are flipped in this line: https://github.com/intel-isl/Open3D/blob/master/cpp/open3d/pipelines/odometry/Odometry.cpp#L436

nachovizzo commented 4 years ago

Thanks @theNded good catch! And apologize for the flip! We could prevent this in the future creating types for the weights and residuals. Although I'm not big fan of it

nanotuxi commented 4 years ago

Can confirm that flipping r and w works with multithreading activated in config. Thanks ;-) ...and For me the first fix in #2567 only works, if I typecast with if str(config["python_multi_threading"]).lower() == "true": ... but the amount of memory used processing the hole pipeline is still reaching about 7-8 GB temporarily. This is still too much for edge devices like a Jetson nano running from an sd card as this would corrupt the card earlier or later. Not everyone is running the os from a usb device. Maybe that could be improved in future?

nanotuxi commented 4 years ago

@nachovizzo Fine I could help with finding a bug :-)

theNded commented 4 years ago

Thanks everyone for identifying the bug! Please try make install-pip-package from the master branch now to check the fix. It is also recommended to run --make, --register, --refine, --integrate in different passes, since joblib is not fast at releasing the memory pool.

This issue will be kept open as we have received many constructive suggestions and analyses in this thread. Relevant problems can be poured here for us to monitor and improve the system.

@nanotuxi this line has been changed to config["python_multi_threading"] == True. It was a legacy issue when the configure file stored strings instead of boolean.

devernay commented 4 years ago

Thanks @theNded ! I will test and report later today

devernay commented 4 years ago

I also think maybe all the tutorials should be run as an integration test before any release

nanotuxi commented 4 years ago

@theNded Yes. Thanks for fixing it. @devernay Good idea.

theNded commented 4 years ago

I also think maybe all the tutorials should be run as an integration test before any release

Agreed. I take responsibility for this issue. In the next release, we will refactor the reconstruction system and provide a C++ counterpart. Please stay tuned.

devernay commented 4 years ago

Much nicer! image

devernay commented 4 years ago

Final results looks good too! @theNded are you going to push a bugfix release with this change soon?

devernay commented 4 years ago

@theNded in commit https://github.com/intel-isl/Open3D/pull/2567/commits/18fb759f2b13247c1569c193b32c267bd9bb271e from https://github.com/intel-isl/Open3D/pull/2567 you forgot to change the default value for python_multi_threading here: https://github.com/intel-isl/Open3D/blob/master/examples/python/reconstruction_system/initialize_config.py#L28

theNded commented 4 years ago

@theNded in commit 18fb759 from #2567 you forgot to change the default value for python_multi_threading here: https://github.com/intel-isl/Open3D/blob/master/examples/python/reconstruction_system/initialize_config.py#L28

Thanks, it is fixed. We are working on building a new wheel, hopefully a new release will be available in 0.11.2 soon.

theNded commented 4 years ago

FYI, another potential memory leak mentioned in #1787 can be circumvented by simply creating a new python enviroment. Please let us know if you can still reproduce that problem in a new python enviroment.

nanotuxi commented 4 years ago

Agreed. I take responsibility for this issue.

No need for this. Everybody is happy you found this (not easy to find) bug so fast. Thanks for the hints concerning all the other little bugs, to.

nachovizzo commented 4 years ago

Agreed. I take responsibility for this issue.

No need for this. Everybody is happy you found this (not easy to find) bug so fast. Thanks for the hints concerning all the other little bugs, to.

I Agree with @nanotuxi . There is no need to take any "responsibility" We are not operating a human's heart, don't stress yourself that much! thanks for the great help and support!

tan-may16 commented 1 year ago

Hi, I am facing the same issue right now. The fragments are created only if the Depth is less than 3m otherwise the reconstruction is killed during integration. Can anyone please help me resolve the issue?

Thanks