norlab-ulaval / libnabo

A fast K Nearest Neighbor library for low-dimensional spaces
http://norlab-ulaval.github.io/libnabo/
BSD 3-Clause "New" or "Revised" License
440 stars 144 forks source link

Problems since latest commits #81

Closed bochen87 closed 6 years ago

bochen87 commented 6 years ago

Since the latest commits we're having issues with ICP_Mapping in ROS. When reverting back to commit 243a708e84969b8611200dfd68af7f2cd4a49a0b (HEAD~3 currently) it works fine again.

I'm suspecting it has something to do with the commits in 34e2cc0fba58c15f64c24aa3a7c4184235fba1b9

See GDB stacktrace here https://paste.ee/p/var2y

HannesSommer commented 6 years ago

Thanks for this issue. Maybe there is more code dependent on the old invalid index value than we thought. Could you please try with #82 (branch : fix/revertInvalidIndexToZero) ? This way we could be sure that it is indeed the new invalid index value causing your issue.

bochen87 commented 6 years ago

Really appreciate the fast reply! Will test out first thing tomorrow morning.

magehrig commented 6 years ago

Does this issue appear consistently?

bochen87 commented 6 years ago

Yes, appears consistently. Seems to appear when fast moving dynamic objects appear.

simonlynen commented 6 years ago

@bochen87 When you evaluate the results of libnabo, it's important to use distance(i, j) != std::numeric_limits<float>::infinity to check if you have a valid result. It seems you aren't doing this and thus use invalid results in your code.

bochen87 commented 6 years ago

@simonlynen I'm using the dynamic_mapper.cpp from https://github.com/ethz-asl/ethzasl_icp_mapping/tree/reintegrate/master_into_indigo_devel The GDB stack trace suggests that this code causes the problem: https://github.com/ethz-asl/ethzasl_icp_mapping/blob/ec96db832be5ad4cf95e2551ef33797903a39bbb/ethzasl_icp_mapper/src/dynamic_mapper.cpp#L847

could it be a fix to check whether

tmp_map.features.rows() > 1

simonlynen commented 6 years ago

Hi @bochen87 yes, from the code it makes sense that this has issues when you don't get readings from the sensors.

For instance here the check if the returned result is valid (i.e. non infinity) is missing; i.e. the code just always uses the results even if there are no points or none of the points falls within max-radius.

magehrig commented 6 years ago

@simonlynen According to the stack trace posted by @bochen87 , the crash happens when executing create and more specifically, when a Node is push_backed inside buildNodes. Nonetheless, I agree with you that icp_mapper code should check for invalid results anyway. However, it seems that the crash is not related to the actual query of the kd-tree but building the tree itself, which indicates that these are separate issues.

@bochen87 An outside check is not needed as it is checked internally. Unfortunately, I cannot reproduce this bug with the provided tests. Would it be possible for you to provide input data that causes the crash?

bochen87 commented 6 years ago

@magehrig unfortunately, input data is pretty big - we're using velodyne VLP-16. I can try and record something maybe tonight or tomorrow. The crash happens pretty consistently, usually with faster moving objects in front of the vehicle, i.e. humans moving into closer range quickly or bicycles. If you have a setup with a velodyne, it should be easy to reproduce.

magehrig commented 6 years ago

@bochen87 I do not have the resources to record my own dataset. We only need the Eigen Matrix that causes the crash. You could use the ofstream class of the fstream header for a quick, although inefficient (not sure if feasible in your case) logging of the matrix:

{
  std::ofstream file("tmp-map-features.txt");
  if (file.is_open())
    file << tmp_map.features;
}

If you put this in front of the create function, you should be able to retrieve the Eigen Matrix.

If you or somebody else has a better approach for debugging this, let me know.

bochen87 commented 6 years ago

Ok, thanks for providing this more efficient method! I'll try to record something tonight or tomorrow.

magehrig commented 6 years ago

Maybe this line could be the source of the crash as well.

HannesSommer commented 6 years ago

@magehrig , I agree, that could be the other position to call into NNS::create.

@bochen87 , if you do the instrumentation with the matrix dump, please do it at both positions.

If the suggested code happens to be too slow, you could use this answer's first function for a binary write : https://stackoverflow.com/a/25389481 (there is no need to put it in the Eigen namespace. Any other should be fine). It should be much faster. (I assume you are working on x86-64 architecture? Then the binary data should be as easy for us to use.)

Thanks for helping hunting the bug!

bochen87 commented 6 years ago

Created the files that retrieves the Eigen Matrix, please download here: https://ufile.io/i8mqa

Will do a new one tomorrow for the other section

bochen87 commented 6 years ago

Yes, I'm working on x86-64 architecture. Two files in binary named after each of the variables. https://drive.google.com/drive/folders/1H1EaBVcUSEGT4iWR2ggMiNjPTvlHhJEJ?usp=sharing

HannesSommer commented 6 years ago

Thanks a lot! I'll have a look.

HannesSommer commented 6 years ago

Reading the matrices worked well. But unfortunately I could not reproduce the issue :(.

@bochen87, are you really writing the matrix directly before the create call? If indeed we might need to start thinking of exotic issues.

bochen87 commented 6 years ago

I've uploaded the CPP file also into the drive, see line 697 and 857

Basically, if I want to reproduce the problem, I start the robot, give it a goal and jump in front of it suddenly (being a dynamic object).

magehrig commented 6 years ago

I just looked at the stack trace again and GDB is signaled from thread 1 (line 210). Unless I am mistaken, we have looked at the wrong part of the code. Libnabo does not appear to be the direct source of this issue because the segfault appears somewhere in libpointmatcher.

HannesSommer commented 6 years ago

Awesome @magehrig ! That's it. Interesting how we all got fooled by the stack traces. Makes much more sense. It probably crashes in here : https://github.com/ethz-asl/libpointmatcher/blob/75044815d40ff934fe0bb7e05ed8bbf18c06493b/pointmatcher/OutlierFiltersImpl.cpp#L253 in case the idRef is now -1 instead of 0.

We should probably "move" this issue over to libpointmatcher .

bochen87 commented 6 years ago

@magehrig @HannesSommer gosh! 👍 I guess that also explains the behavior happening with "fast moving objects"