This is not a high severity issue, since a new job is still able to run fine, but it's a bit confusing to see on the dashboard, as no actors hold references to things in the object store anymore.
Repro script:
file_uri = "s3://air-example-data-2/100G-xgboost-data.parquet/d9ef953e9a7347db8793f9e772357e68_000888.parquet"
num_copies = 24_000
ds = ray.data.read_parquet([file_uri for i in range(num_copies)])
def train_fn(config):
print("training started...")
ds = ray.train.get_dataset_shard("train")
for batch in ds.iter_batches(batch_size=32):
print(batch)
time.sleep(2)
trainer = TorchTrainer(train_fn, scaling_config=ray.train.ScalingConfig(num_workers=4), datasets={"train": ds})
trainer.fit()
See the object store memory usage after the job finishes at 17:15.
Ah, this is because the Ray object store allocates objects within the shared memory pool. As long as Ray is still up, you will see this shared memory usage.
This is not a high severity issue, since a new job is still able to run fine, but it's a bit confusing to see on the dashboard, as no actors hold references to things in the object store anymore.
Repro script:
See the object store memory usage after the job finishes at 17:15.