huawei-noah / SMARTS

Scalable Multi-Agent RL Training School for Autonomous Driving
MIT License
951 stars 190 forks source link

CI tests have non-deterministic results #1230

Open sah-huawei opened 2 years ago

sah-huawei commented 2 years ago

BUG REPORT

High Level Description This week I've noticed our CI tests failing more often. However, when the tests are re-run, they then pass. We need to track down the source of this non-determinism.

SMARTS version latest (~0.5) on the develop branch

This happened to me on PRs #1222, #1219, and #1205. Each failed on one of the last commits before merging, but then passed after the tests were re-run.

At least some of the failures seem to have to do with gRPC timeouts.

For example, here's one where the first attempt (linked) failed, but then it passed on re-submission: https://github.com/huawei-noah/SMARTS/actions/runs/1665218258/attempts/1

Here's the failing test summary from that:

=========================== short test summary info ============================
FAILED smarts/env/tests/test_frame_stack.py::test_frame_stack[2]
FAILED smarts/env/tests/test_parallel_env.py::test_sync_async_episodes[True]
FAILED smarts/env/tests/test_parallel_env.py::test_reset[2]
FAILED smarts/env/tests/test_parallel_env.py::test_sync_async_episodes[False]
============ 4 failed, 19 passed, 40 warnings in 105.14s (0:01:45) 
Gamenot commented 2 years ago

It happened again. https://github.com/huawei-noah/SMARTS/runs/4742297574?check_suite_focus=true#step:5:995

I can note that the error is simulator state based so the retry of reset() does not help as can be seen here.

...
(pid=478) ERROR:SMARTS:Failed to successfully reset after 2 times.
...
File "/__w/SMARTS/SMARTS/.venv/lib/python3.7/site-packages/ray/util/iter.py", line 471, in base_iterator
    yield ray.get(futures, timeout=timeout)
...
  File "/__w/SMARTS/SMARTS/smarts/core/remote_agent.py", line 100, in terminate
    manager_pb2.Port(num=self._worker_address[1])
...
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
    status = StatusCode.INVALID_ARGUMENT
    details = "Trying to stop nonexistent worker with a port 44645."

The context seems to come from:

https://github.com/huawei-noah/SMARTS/blob/57b8462c868a96ec8c542471bdaf7e0901d9627c/smarts/zoo/manager_servicer.py#L72-L78

and

https://github.com/huawei-noah/SMARTS/blob/57b8462c868a96ec8c542471bdaf7e0901d9627c/smarts/core/remote_agent.py#L97-L109

sah-huawei commented 2 years ago

After much trial-and-error, I've finally made some progress in diagnosing this.

It looks like the problem causing the test to sometimes-but-not-always fail in test_stack_frame.py is:

Note that reset() triggers multiple steps (about 11) for this use of the loop scenario because it doesn't return until there are observations for an ego agent, so until one or both agents capture a vehicle or their trap patience expires.

Although I still don't know the source of the non-determinism, it seems that failure path is more likely to be taken when multiple CI jobs are running simultaneously (which is what I've been doing in order to get any info about it at all), presumably "bogging down" the CPUs on the server instance where the CI runs (?). I don't yet understand why this would have an effect on the Sumo-controlled vehicles positioning on each step, but at least I now have a clue as to where to look.

sah-huawei commented 2 years ago

Also, regarding the failures coming from the tests in test_parallel_env.py, I've noticed that the seed passed to the sumo command when starting it in SumoTrafficSimulation is different for one of the two parallel processes than is used for the other process or for the single-env case. This may be relevant, or a red herring... I still need to investigate this further.

Gamenot commented 2 years ago

@sah-huawei the difference in seed between the processes should still be OK as long as it is constant. If it is changing then that is an issue.

sah-huawei commented 2 years ago

@sah-huawei the difference in seed between the processes should still be OK as long as it is constant. If it is changing then that is an issue.

Yeah, I think you're right -- it's a red herring.

sah-huawei commented 2 years ago

To clarify something I wrote above: It's actually not patience that's causing it to take ~11 steps for a new vehicle to be created or captured (depending on the pass/fail path), but rather the mission start time is 0.1 and the step time is 0.01, so it takes 10 steps before the trap is "ready".

Once it is ready, at that point either the vehicle will be immediately captured (in the passing case) or one will be immediately created (in the failing case) -- or at least attempted. So the problem to solve is why the vehicle isn't captured when it fails.

To that end, it appears its position is different in each case. On the crucial step, it's:

Edit: The above position values for when it fails to capture turned out to NOT be from car-flow-random-route-17fc695a-7772817341121806087--4-0.0 -- it's somewhere else entirely! -- but from some other Sumo vehicle. This implies to me that the scenarios/loop/traffic/basic.rou.xml file generated by the CI job in the failing case may be different.