UNIT_agent_simulation_builder_test failed in gcc but not with clang.

agalbachicar commented 3 years ago

Found the error in CI --> https://github.com/ToyotaResearchInstitute/dsim-repos-index/pull/169

The extract of the log follows:

build/delphyne/Testing/20210210-1401/Test.xml: 89 tests, 0 errors, 1 failure, 0 skipped
- UNIT_agent_simulation_builder_test
  <<< failure message
    -- run_test.py: invoking following command in '/__w/dsim-repos-index/dsim-repos-index/maliput_ws/src/delphyne':
     - /__w/dsim-repos-index/dsim-repos-index/maliput_ws/build/delphyne/test/regression/cpp/UNIT_agent_simulation_builder_test --gtest_output=xml:/__w/dsim-repos-index/dsim-repos-index/maliput_ws/build/delphyne/test_results/delphyne/UNIT_agent_simulation_builder_test.gtest.xml
    Running main() from /opt/ros/dashing/src/gtest_vendor/src/gtest_main.cc
    [==========] Running 14 tests from 1 test case.
    [----------] Global test environment set-up.
    [----------] 14 tests from AgentSimulationTest
    [ RUN      ] AgentSimulationTest.TestGetVisualScene
    [NON-XML-CHAR-0x1B][1;33mWarning [Model.cc:210][NON-XML-CHAR-0x1B][0m Non-unique names detected in XML children of model with name[prius_1].
    [NON-XML-CHAR-0x1B][1;33mWarning [Model.cc:261][NON-XML-CHAR-0x1B][0m Joint with name [body] in model with name [prius_1] has a name collision, changing joint name to [body_joint].
    [       OK ] AgentSimulationTest.TestGetVisualScene (167 ms)
    [ RUN      ] AgentSimulationTest.BasicTest
    [NON-XML-CHAR-0x1B][1;33mWarning [Model.cc:210][NON-XML-CHAR-0x1B][0m Non-unique names detected in XML children of model with name[prius_1].
    [NON-XML-CHAR-0x1B][1;33mWarning [Model.cc:261][NON-XML-CHAR-0x1B][0m Joint with name [body] in model with name [prius_1] has a name collision, changing joint name to [body_joint].
    [NON-XML-CHAR-0x1B][1;33mWarning [Model.cc:210][NON-XML-CHAR-0x1B][0m Non-unique names detected in XML children of model with name[prius_1].
    [NON-XML-CHAR-0x1B][1;33mWarning [Model.cc:261][NON-XML-CHAR-0x1B][0m Joint with name [body] in model with name [prius_1] has a name collision, changing joint name to [body_joint].
    [       OK ] AgentSimulationTest.BasicTest (157 ms)
    [ RUN      ] AgentSimulationTest.TestPriusSimpleCar
    [NON-XML-CHAR-0x1B][1;33mWarning [Model.cc:210][NON-XML-CHAR-0x1B][0m Non-unique names detected in XML children of model with name[prius_1].
    [NON-XML-CHAR-0x1B][1;33mWarning [Model.cc:261][NON-XML-CHAR-0x1B][0m Joint with name [body] in model with name [prius_1] has a name collision, changing joint name to [body_joint].
    [       OK ] AgentSimulationTest.TestPriusSimpleCar (186 ms)
    [ RUN      ] AgentSimulationTest.TestPriusUnicycleCar
    [NON-XML-CHAR-0x1B][1;33mWarning [Model.cc:210][NON-XML-CHAR-0x1B][0m Non-unique names detected in XML children of model with name[prius_1].
    [NON-XML-CHAR-0x1B][1;33mWarning [Model.cc:261][NON-XML-CHAR-0x1B][0m Joint with name [body] in model with name [prius_1] has a name collision, changing joint name to [body_joint].
    -- run_test.py: return code -11
    -- run_test.py: generate result file '/__w/dsim-repos-index/dsim-repos-index/maliput_ws/build/delphyne/test_results/delphyne/UNIT_agent_simulation_builder_test.gtest.xml' with failed test
    -- run_test.py: verify result file '/__w/dsim-repos-index/dsim-repos-index/maliput_ws/build/delphyne/test_results/delphyne/UNIT_agent_simulation_builder_test.gtest.xml'
  >>>
build/delphyne/test_results/delphyne/UNIT_agent_simulation_builder_test.gtest.xml: 1 test, 0 errors, 1 failure, 0 skipped
- delphyne UNIT_agent_simulation_builder_test.gtest.missing_result
  <<< failure message
    The test did not generate a result file:

    Running main() from /opt/ros/dashing/src/gtest_vendor/src/gtest_main.cc
    [==========] Running 14 tests from 1 test case.
    [----------] Global test environment set-up.
    [----------] 14 tests from AgentSimulationTest
    [ RUN      ] AgentSimulationTest.TestGetVisualScene
    ?[1;33mWarning [Model.cc:210]?[0m Non-unique names detected in XML children of model with name[prius_1].
    ?[1;33mWarning [Model.cc:261]?[0m Joint with name [body] in model with name [prius_1] has a name collision, changing joint name to [body_joint].
    [       OK ] AgentSimulationTest.TestGetVisualScene (167 ms)
    [ RUN      ] AgentSimulationTest.BasicTest
    ?[1;33mWarning [Model.cc:210]?[0m Non-unique names detected in XML children of model with name[prius_1].
    ?[1;33mWarning [Model.cc:261]?[0m Joint with name [body] in model with name [prius_1] has a name collision, changing joint name to [body_joint].
    ?[1;33mWarning [Model.cc:210]?[0m Non-unique names detected in XML children of model with name[prius_1].
    ?[1;33mWarning [Model.cc:261]?[0m Joint with name [body] in model with name [prius_1] has a name collision, changing joint name to [body_joint].
    [       OK ] AgentSimulationTest.BasicTest (157 ms)
    [ RUN      ] AgentSimulationTest.TestPriusSimpleCar
    ?[1;33mWarning [Model.cc:210]?[0m Non-unique names detected in XML children of model with name[prius_1].
    ?[1;33mWarning [Model.cc:261]?[0m Joint with name [body] in model with name [prius_1] has a name collision, changing joint name to [body_joint].
    [       OK ] AgentSimulationTest.TestPriusSimpleCar (186 ms)
    [ RUN      ] AgentSimulationTest.TestPriusUnicycleCar
    ?[1;33mWarning [Model.cc:210]?[0m Non-unique names detected in XML children of model with name[prius_1].
    ?[1;33mWarning [Model.cc:261]?[0m Joint with name [body] in model with name [prius_1] has a name collision, changing joint name to [body_joint].
  >>>

It is strange but the test passes with clang.

francocipollone commented 3 years ago

This test seems to be failing not because of the implementation of the tests in the agent_simulation_builder_test.cc file but because of some bad parsing in the message that prints the warnings.(it is printed by a third library )

I couldn't reproduce this error and at the moment we run out of minutes in CI so I couldn't get to the bottom of the issue however I believe that if we resolve those warnings this error won't be shown anymore and to solve that we only have to change one line in the simple_prius.sdf file. (Check change)

WDYT @agalbachicar if I add that minor modification and we check later on if it is needed to continue debugging or not?

agalbachicar commented 3 years ago

WDYT @agalbachicar if I add that minor modification and we check later on if it is needed to continue debugging or not?

The change seems good to me. I doubt that is the only reason why the test failed. Let's start with that.

francocipollone commented 3 years ago

WDYT @agalbachicar if I add that minor modification and we check later on if it is needed to continue debugging or not?

The change seems good to me. I doubt that is the only reason why the test failed. Let's start with that.

Yeah, for what the issue description says technically the return code of the run_test.py command is -11 which in unix mean SEGFAULT. At the moment I couldn't reproduce it locally. I am still trying though.

francocipollone commented 3 years ago

I was barely able to reproduce it, just a few times. (reducing the cpu usage of the container and doing a bit of magic).

I can tell two things so far:

The body of the test is correctly executed. I've added prints at the end of the test and there is no problems up to there.
run_test.py (from /opt/ros/dashing/share/ament_cmake_test/cmake): I added some prints there just to check if it is an issue located in this script when executing the test. But no, the return code -11 is returned by the test(gtest)

So I should investigate a bit and understand why the gtest tool is returning -11 when test is finished with no fails. An error when destroying the objects after the test ends?. Odd.

Extra: I've checked the test_result (build/delphyne/test_results/delphyne/UNIT_agent_simulation_builder_test.gtest.xml) after having the error and it shows:

<?xml version="1.0" encoding="UTF-8"?>
<testsuite name="delphyne" tests="1" failures="1" time="0" errors="0" skipped="0">
  <testcase classname="delphyne" name="UNIT_agent_simulation_builder_test.gtest.missing_result" time="0">
    <failure message="The test did not generate a result file:&#10;&#10;Running main() from /opt/ros/dashing/src/gtest_vendor/src/gtest_main.cc&#10;[==========] Running 5 tests from 1 test case.&#10;[----------] Global test environment set-up.&#10;[----------] 5 tests from AgentSimulationTest&#10;[ RUN      ] AgentSimulationTest.TestGetVisualScene&#10;[       OK ] AgentSimulationTest.TestGetVisualScene (799 ms)&#10;[ RUN      ] AgentSimulationTest.BasicTest&#10;[       OK ] AgentSimulationTest.BasicTest (1301 ms)&#10;[ RUN      ] AgentSimulationTest.TestPriusSimpleCar&#10;-1&#10;0&#10;1&#10;2&#10;3&#10;4&#10;5&#10;6&#10;7&#10;8&#10;9&#10;10&#10;11&#10;12&#10;13&#10;14&#10;15&#10;16&#10;17&#10;18&#10;19&#10;20&#10;21&#10;22&#10;23&#10;24&#10;"/>
  </testcase>
</testsuite>

Which is in essence is the same failure message shown in the issue's description.

agalbachicar commented 3 years ago

I was barely able to reproduce it, just a few times. (reducing the cpu usage of the container and doing a bit of magic).

Consider commenting out tests to do a one vs all evaluation and find the test that has the segfault. Another alternative is to split tests across multiple buildable units.

francocipollone commented 3 years ago

I was able to reproduce the error by doing the following:

I am running 3 containers at the same time.
Each container was run with the --cpus 0.1 command in order to use 10% of only one cpu.
The test is executed in a loop 100 times.
Just for not wasting time in other tests I commented out other tests and focused it on TestPriusSimpleCar and TestPriusUnicycleCar which they both had presented the aforementioned error.

By doing this I was able to enforce the error between 10 and 20 times (out of 300 runs.)

Once I could reproduce it in a "controlled" way I commented out the lines related to ignition In particular I removed
```
test::IgnMonitor<ignition::msgs::AgentState_V> ign_monitor(kStateTopicName);
```
and all the lines related to the ign messages. ---> No error was thrown.

The next test I did was to just left the creation of the ignMonitor object:

test::IgnMonitor<ignition::msgs::AgentState_V> ign_monitor(kStateTopicName);

in the TestPriusSimpleCar test. ---> run_test.py: return code -11

I guess that when the ign_montior object's resources are meant to be deallocated the error comes up.

So now the problem is way more narrowed down than before.

agalbachicar commented 3 years ago

Excellent investigation @francocipollone !

francocipollone commented 3 years ago

Update:

Context: IgnMonitor class subscribes to a topic and attaches a callback method to count and save every new message. The problem comes up when the resources are being deallocated, It happens that, IgnMonitor's object is destroyed and after this, the callback method is called which leads to a SEGFAULT.

I tried destroying the IgnMonitor object at the end of the execution and adding a node_.Unsubscribe() in the destructor method but the results are the same, the callback is still being called after the class is destroyed(and the node is unsubscribed)

I'll continue working on a solution

francocipollone commented 3 years ago

Solved. Further discussion if any should continue in #760

maliput / delphyne

UNIT_agent_simulation_builder_test failed in gcc but not with clang. #751