Closed zainmehdi closed 3 years ago
@theNded can you have a look at this please.
Ok so there is a definite memory leak somewhere. I have a core i7 2.6 Ghz with 16GB of RAM and its consuming all the memory and CPU usage is very high as well. Here is the screen shot
There are some issues with python multithread regarding memory usage. Can you try to turn it off in the config file and try again? Another potential issue can be the failure of global registration. In global registration I can see there are 19 nodes and 18 edges, which means no valid loop closure is detected. This may cause incorrect integration into an unreasonably large space.
@theNded Thanks for the reply. I tried turning multi threading option false and even then it causes segmentation fault if the voxel size is smaller than 5cm and depth scale in image is beyond 3m.
Does 5cm and 3m work? In fact voxel size in config file is a little bit confusing -- the real voxel size is 5cm / 512 (see https://github.com/intel-isl/Open3D/blob/master/examples/python/ReconstructionSystem/integrate_scene.py#L24). So it can be good to check the result of this default configuration.
Another suggestion can be checking the reconstruction of fragments in the fragment
folder using https://github.com/intel-isl/Open3D/blob/master/examples/python/ReconstructionSystem/debug/visualize_fragments.py.
Yes 5cm and 3m work. And I checked the results of the fragments. It produces ply files. Here are few more interesting observations.
- If I try to run Fragmentation part with multi-threading it crashes. Running it without multi threading works with the above mentioned voxel and depth values. Else it fails even in single threading mode.
For multithread you may want to tune MAX_THREAD here: https://github.com/intel-isl/Open3D/blob/master/examples/python/ReconstructionSystem/make_fragments.py#L181
It makes sense if the resolution is very high, otherwise it will be interesting to look at the log.
- The registration part doesnt work with single thread no matter what the voxel and the depth values. If I change the config to multi-threading it works.
This is weird, can you put the log here?
Here is the log for the second case. For the first case I will do more test and share some logs with you. Configuration
name : Captured frames using Realsense
path_dataset : dataset/realsense/
path_intrinsic : dataset/realsense/camera_intrinsic.json
max_depth : 3.0
voxel_size : 0.1
max_depth_diff : 0.07
preference_loop_closure_odometry : 0.1
preference_loop_closure_registration : 5.0
tsdf_cubic_size : 3.0
icp_method : color
global_registration : ransac
python_multi_threading : False
depth_map_type : redwood
n_frames_per_fragment : 100
n_keyframes_per_n_frame : 5
min_depth : 0.3
folder_fragment : fragments/
template_fragment_posegraph : fragments/fragment_%03d.json
template_fragment_posegraph_optimized : fragments/fragment_optimized_%03d.json
template_fragment_pointcloud : fragments/fragment_%03d.ply
folder_scene : scene/
template_global_posegraph : scene/global_registration.json
template_global_posegraph_optimized : scene/global_registration_optimized.json
template_refined_posegraph : scene/refined_registration.json
template_refined_posegraph_optimized : scene/refined_registration_optimized.json
template_global_mesh : scene/integrated.ply
template_global_traj : scene/trajectory.log
debug_mode : False
Filament library loaded.
register fragments.
reading dataset/realsense/fragments/fragment_000.ply ...
Segmentation fault (core dumped)
It would be nice if you can share your data somewhere, I think I have to take a look closely...
Sorry for coming back late. There are some problems with the data:
path_intrinsic
and point it to your specific intrinsic config file.Thanks alot. I will look into it. I tried with Azure Kinect and results were better.
0.11.0 has this same memory leak issue with the reconstruction_system tutorial (using the data from 016.zip), so I don't think the data is faulty here.
It fails on a machine with 256Gb RAM (keeping the default config parameters). open3d.pipelines.integration.ScalableTSDFVolume.integrate
seems to be responsible for this leak, as suggested by the crash report above, which is similar to the one I got, with a more explicit error message (I checked using htop, and it fills up the 256Gb+swap pretty quickly):
...
Fragment 002 / 013 :: integrate rgbd frame 287 (88 of 100).
Traceback (most recent call last):
File "run_system.py", line 68, in <module>
make_fragments.run(config)
File ".../Open3D/examples/python/reconstruction_system/make_fragments.py", line 182, in run
Parallel(n_jobs=MAX_THREAD)(delayed(process_single_fragment)(
File ".../lib/python3.8/site-packages/joblib/parallel.py", line 1061, in __call__
self.retrieve()
File ".../lib/python3.8/site-packages/joblib/parallel.py", line 940, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File ".../lib/python3.8/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
return future.result(timeout=timeout)
File ".../lib/python3.8/concurrent/futures/_base.py", line 439, in result
return self.__get_result()
File ".../lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
raise self._exception
joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.
The exit codes of the workers are {SIGKILL(-9)}
Tested with 0.10.0: works fine. Anyone encountering this problem should downgrade Open3D for now.
Can confirm this behaviour. Started experimenting with downgrading Open3D to 0.10 and run reconstruction on the 0.11.1 codebase. The behaviour is this: Running the same reconstruction with images from a D435i runs -fine with v0.10 and multithreading activated in config. -shows memory leak behaviour on a 16GB machine with 4GB swap with 0.11.1 -runs fine with multithreading set to false in config file with 0.11.1
This only works with
python run_system.py config/realsense.json --make
as TransformationEstimationForColoredICP
is new in 0.11
So this bug seems to be re-introduced with this pr?
[#2497]
This solution does not work. So I reverted to open3d v0.10 as @devernay suggested.
~~So the solution for me is to
include the patch from pr #2562
and set "python_multi_threading": false
in config/realsense.json
After that the reconstruction pipeline works ...fast Hui ;.) Thanks to @nachovizzo and @stachnis for #2497.~~
@nachovizzo Is it possible to reduce the amount of memory used for refinement. Running this on a recorded set of images captured with a D435i the machine uses about all of the physical memory (16GB) and additional 4-5GB swap space. I did not try (and maybe never do) v0.11.x on a Jetson Nano as this wouldn't work. So can we tweak settings to reduce the amount of memory to make this work on an edge device like the jetson?
@nanotuxi I guess you got the wrong person :) I haven't been working on the reconstruction pipeline for Open3D
. That PR you mention(#2497) it's just an extension of the robust kernels we've added to the Open3D registration pipeline, but not the reconstruction. All the information about Robust kernels can be found on http://www.open3d.org/docs/release/tutorial/pipelines/robust_kernels.html But I guess this won't help you!
Best luck!
@theNded can you confirm that you observe the same regression between 0.10 and 0.11 on the tutorial data? Any idea where the leak may come from?
Just FYI this is not a memory leak, as everything is freed correctly on exit, but valgrind --tool=massif
shows there are huge allocations of vector
Yes. It seems that the problem appears with rgbd images being loaded completely into memory which makes refining a scene unusable at the moment (for me :-) Maybe related to #2372 ? And maybe this diff ?
Here's a sample run with 200 frames, grouped into 10 fragments of 20 frames. you see that a fragment takes almost 8Gb when being built. I observed 2.5Gb with 10 frames, so I guess it may be roughly quadratic with the number of frames in a fragment, which explains why it crashes at about 88 frames on a 256Gb machine
Apologize for the late reply, I have reproduced the problem and will be actively working on it. Please keep tuned.
I can observe that the pose graph of a fragment is irregular (see below). This will cause the ScalableTSDFVolume to continuously activate new blocks in the unobserved 3D space (an expected behavior), which will cause the memory explosion. So this problem converges to the initial issue -- when the camera poses are wrong, the memory will be consumed.
Now the issue is that why RGBD odometry failed, causing a wrong pose graph. #2497 touched this part, so it is possible that the problem is introduced in this PR. I'm investigating this.
Also, this is a good warning. I will put it on my task list to add an exit flag in the reconstruction system: when poses are wrong and too much memory is allocated for TSDF volumes, we exit and post a message telling the user that there might be an issue with data and or odometry/registration.
@theNded Thanks.
Update: I reverted to d72d2a7
, the commit before #2497, and the problem is still there. So @nachovizzo is innocent :-)
This is the incorrect RGBD odometry result of two adjacent frames. Apparently the transformation is supposed to be identical.
Update: Identified problem is with pybind. cpp example works as expected.
@theNded Thanks :-) But this was not a judgement. It was a try to find the reason for a problem.
Confirmed this bug is introduced by #2497. While I am re-reviewing the code, it will be nice that @nachovizzo can help me check the potential problems. And any help from the active users are greatly appreciated :-)
Here we estimate RGB-D odometry between frame 1 and frame 2 for the tutorial dataset.
Results for 30329b2d44ccea682f088b3bc8909e956ae02ac4 (after introducing #2497)
0.655147 -0.746888 -0.113757 0.215351
0.576672 0.591647 -0.563386 1.49306
0.48809 0.303501 0.818325 -1.47575
0 0 0 1
Incorrect results for both cpp and python.
Results for d72d2a721d5d269bf93b430ed28661d8c6e5ecdd (before introducing #2497)
0.999995 -0.00316241 0.000960094 -0.00110673
0.00316085 0.999994 0.00161349 0.0002658
-0.000965191 -0.00161045 0.999998 -0.000378882
0 0 0 1
Correct results for both cpp and python.
Correct fragment reconstruction for the reconstruction_system
Bug found: r
and w
are flipped in this line: https://github.com/intel-isl/Open3D/blob/master/cpp/open3d/pipelines/odometry/Odometry.cpp#L436
Thanks @theNded good catch! And apologize for the flip! We could prevent this in the future creating types for the weights and residuals. Although I'm not big fan of it
Can confirm that flipping r and w works with multithreading activated in config. Thanks ;-)
...and For me the first fix in #2567 only works, if I typecast with
if str(config["python_multi_threading"]).lower() == "true":
... but the amount of memory used processing the hole pipeline is still reaching about 7-8 GB temporarily. This is still too much for edge devices like a Jetson nano running from an sd card as this would corrupt the card earlier or later. Not everyone is running the os from a usb device. Maybe that could be improved in future?
@nachovizzo Fine I could help with finding a bug :-)
Thanks everyone for identifying the bug! Please try make install-pip-package
from the master branch now to check the fix.
It is also recommended to run --make
, --register
, --refine
, --integrate
in different passes, since joblib
is not fast at releasing the memory pool.
This issue will be kept open as we have received many constructive suggestions and analyses in this thread. Relevant problems can be poured here for us to monitor and improve the system.
@nanotuxi this line has been changed to config["python_multi_threading"] == True
. It was a legacy issue when the configure file stored strings instead of boolean.
Thanks @theNded ! I will test and report later today
I also think maybe all the tutorials should be run as an integration test before any release
@theNded Yes. Thanks for fixing it. @devernay Good idea.
I also think maybe all the tutorials should be run as an integration test before any release
Agreed. I take responsibility for this issue. In the next release, we will refactor the reconstruction system and provide a C++ counterpart. Please stay tuned.
Much nicer!
Final results looks good too! @theNded are you going to push a bugfix release with this change soon?
@theNded in commit https://github.com/intel-isl/Open3D/pull/2567/commits/18fb759f2b13247c1569c193b32c267bd9bb271e from https://github.com/intel-isl/Open3D/pull/2567 you forgot to change the default value for python_multi_threading here: https://github.com/intel-isl/Open3D/blob/master/examples/python/reconstruction_system/initialize_config.py#L28
@theNded in commit 18fb759 from #2567 you forgot to change the default value for python_multi_threading here: https://github.com/intel-isl/Open3D/blob/master/examples/python/reconstruction_system/initialize_config.py#L28
Thanks, it is fixed. We are working on building a new wheel, hopefully a new release will be available in 0.11.2 soon.
FYI, another potential memory leak mentioned in #1787 can be circumvented by simply creating a new python enviroment. Please let us know if you can still reproduce that problem in a new python enviroment.
Agreed. I take responsibility for this issue.
No need for this. Everybody is happy you found this (not easy to find) bug so fast. Thanks for the hints concerning all the other little bugs, to.
Agreed. I take responsibility for this issue.
No need for this. Everybody is happy you found this (not easy to find) bug so fast. Thanks for the hints concerning all the other little bugs, to.
I Agree with @nanotuxi . There is no need to take any "responsibility" We are not operating a human's heart, don't stress yourself that much! thanks for the great help and support!
Hi, I am facing the same issue right now. The fragments are created only if the Depth is less than 3m otherwise the reconstruction is killed during integration. Can anyone please help me resolve the issue?
Thanks
I am trying to run reconstruction pipeline with the python scripts provided. I recorded my data using realsense d435i. When I run the script it is able to register the fragment but just after this phase it gives me segmentation fault.
To Reproduce I used the sample dataset sequence 16 as provided in the tutorial. If you try to run the pipeline it should give segmentation fault. The below screenshot is however of my custom data but the results were same in both cases.
python run_system.py config/realsense.json --make --register --refine --integrate Expected behavior Segmentation fault
Screenshots If applicable, add screenshots to help explain your problem.
I am running in in conda base environment
Additional context