Open Zandew opened 11 months ago
@c21 @Zandew can you follow up on target ray release priority and size?
I think we were going to revisit this once LazyBlockList
is deprecated, which is scheduled for 2.10. Although I think this error can potentially happen in other places as well.
priority @Zandew and sizing of work here?
What happened + What you expected to happen
Copied from #41299:
When adding a lock to
_get_or_create_stats_actor()
, the result was a deadlock.Somehow it looks like the
DataIterator.__del__
of an old iterator belonging to a previous test is triggered somewhere while calling_StatsActor.remote()
. I don't know why it wasn't gc'd before this. This is the stack trace of the main thread, I labelled the two entrances:I'm not sure how it jumps from
fn(*args, **kwargs)
toStatsManager.clear_iteration_metrics(self._get_dataset_tag())
. This part is flaky too, sometimes__del__
is called from someplace inpickle.dumps
while serializing the_StatsActor
.This is sometimes reproducible by running
python -m pytest test_map.py::test_map_batches_basic test_map.py::test_map_batches_extra_args
. TheDataIterator
that is deleted belongs to the last iterator intest_map_batches_basic
:This error does not happen if we remove this part of the test or make it not raise an error. It also doesn't happen if we
gc.collect()
after this.This was temporarily solved by using an rlock.
Versions / Dependencies
https://github.com/ray-project/ray/commit/ebc7a39808878050ff94af6d7106d47e93a2e1e3
Reproduction script
above
Issue Severity
Low: It annoys or frustrates me.