Open sah-huawei opened 2 years ago
It happened again. https://github.com/huawei-noah/SMARTS/runs/4742297574?check_suite_focus=true#step:5:995
I can note that the error is simulator state based so the retry of reset()
does not help as can be seen here.
...
(pid=478) ERROR:SMARTS:Failed to successfully reset after 2 times.
...
File "/__w/SMARTS/SMARTS/.venv/lib/python3.7/site-packages/ray/util/iter.py", line 471, in base_iterator
yield ray.get(futures, timeout=timeout)
...
File "/__w/SMARTS/SMARTS/smarts/core/remote_agent.py", line 100, in terminate
manager_pb2.Port(num=self._worker_address[1])
...
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.INVALID_ARGUMENT
details = "Trying to stop nonexistent worker with a port 44645."
The context seems to come from:
and
After much trial-and-error, I've finally made some progress in diagnosing this.
It looks like the problem causing the test to sometimes-but-not-always fail in test_stack_frame.py
is:
car-flow-random-route-17fc695a-7772817341121806087--4-0.0
) right at the point when its patience is also expiring for its trap; butcar-flow-random-route-17fc695a-7772817341121806087--4-0.0
but now I'm not sure, see below), so "Agent-002" will not get a vehicle at all on the step and the result of the reset()
call will only show one vehicle ("Agent-001").Note that reset()
triggers multiple steps (about 11) for this use of the loop
scenario because it doesn't return until there are observations for an ego agent, so until one or both agents capture a vehicle or their trap patience expires.
Although I still don't know the source of the non-determinism, it seems that failure path is more likely to be taken when multiple CI jobs are running simultaneously (which is what I've been doing in order to get any info about it at all), presumably "bogging down" the CPUs on the server instance where the CI runs (?). I don't yet understand why this would have an effect on the Sumo-controlled vehicles positioning on each step, but at least I now have a clue as to where to look.
Also, regarding the failures coming from the tests in test_parallel_env.py
, I've noticed that the seed
passed to the sumo
command when starting it in SumoTrafficSimulation
is different for one of the two parallel processes than is used for the other process or for the single-env case. This may be relevant, or a red herring... I still need to investigate this further.
@sah-huawei the difference in seed between the processes should still be OK as long as it is constant. If it is changing then that is an issue.
@sah-huawei the difference in seed between the processes should still be OK as long as it is constant. If it is changing then that is an issue.
Yeah, I think you're right -- it's a red herring.
To clarify something I wrote above: It's actually not patience that's causing it to take ~11 steps for a new vehicle to be created or captured (depending on the pass/fail path), but rather the mission start time is 0.1 and the step time is 0.01, so it takes 10 steps before the trap is "ready".
Once it is ready, at that point either the vehicle will be immediately captured (in the passing case) or one will be immediately created (in the failing case) -- or at least attempted. So the problem to solve is why the vehicle isn't captured when it fails.
To that end, it appears its position is different in each case. On the crucial step, it's:
[146.97441763 49.46865622]
when it fails to capture; and[149.00032637 51.81457216]
when it captures.Edit: The above position values for when it fails to capture turned out to NOT be from car-flow-random-route-17fc695a-7772817341121806087--4-0.0
-- it's somewhere else entirely! -- but from some other Sumo vehicle. This implies to me that the scenarios/loop/traffic/basic.rou.xml
file generated by the CI job in the failing case may be different.
BUG REPORT
High Level Description This week I've noticed our CI tests failing more often. However, when the tests are re-run, they then pass. We need to track down the source of this non-determinism.
SMARTS version latest (~0.5) on the
develop
branchThis happened to me on PRs #1222, #1219, and #1205. Each failed on one of the last commits before merging, but then passed after the tests were re-run.
At least some of the failures seem to have to do with gRPC timeouts.
For example, here's one where the first attempt (linked) failed, but then it passed on re-submission: https://github.com/huawei-noah/SMARTS/actions/runs/1665218258/attempts/1
Here's the failing test summary from that: