Closed ThomasCassimon closed 1 year ago
thanks for bringing this up @ThomasCassimon. We'll try to get a fix out for this in the coming week.
Quick note, looking at the release notes for 2.3, it seems that the exception thrown was changed to ValueError
(See: https://github.com/ray-project/ray/pull/30255)
@avnishn do you expect this fix will make it into the next release of ray (2.4), or will it take longer to fix this issue?
I'm attempting to repro your bug right now. If it is infact a bug the fix will land for 2.4
ok the fix is up. Thank you for making such a great repro script @ThomasCassimon
What happened + What you expected to happen
Recent changes in the Replay Buffer APIs have made it so Apex DQN crashes while trying to add a sample to its replaybuffer.
The reproduction script below uses TensorFlow and CartPole-v1, but I have observed the same behaviour with PyTorch and a custom environment.
When I run the reproduction script below, I get the following output:
At some point in the output, you can see the line
DeprecationWarning: `add_batch` has been deprecated. Use `ReplayBuffer.add()` instead.
This causes an exception to be thrown in the worker, this propagates up to
ray
's exception handling, which tries to access aconfig
member on the object that threw the exception (aMultiAgentPrioritizedReplayBuffer
), which fails (AttributeError: 'MultiAgentPrioritizedReplayBuffer' object has no attribute 'config'
)After this point, the tune trial gets stuck. Tune claims the trial is running, but no progress is made.
I have recreated the circumstances for this bug in the reproduction script below (using the
throw_replay_buffer_error
function) and can confirm that this throws an exception and doesn't printBla!
.Exepected behaviour: I expect to be able to train a Apex DQN agent on the CartPole problem without crashes.
Versions / Dependencies
OS:
Ray version:
Python version:
Reproduction script
Issue Severity
High: It blocks me from completing my task.