ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.85k stars 5.75k forks source link

[core][data] Object store shared memory not getting cleared after using Ray Data #44610

Open justinvyu opened 7 months ago

justinvyu commented 7 months ago

This is not a high severity issue, since a new job is still able to run fine, but it's a bit confusing to see on the dashboard, as no actors hold references to things in the object store anymore.

Repro script:

file_uri = "s3://air-example-data-2/100G-xgboost-data.parquet/d9ef953e9a7347db8793f9e772357e68_000888.parquet"
num_copies = 24_000

ds = ray.data.read_parquet([file_uri for i in range(num_copies)])

def train_fn(config):
    print("training started...")
    ds = ray.train.get_dataset_shard("train")
    for batch in ds.iter_batches(batch_size=32):
        print(batch)
        time.sleep(2)

trainer = TorchTrainer(train_fn, scaling_config=ray.train.ScalingConfig(num_workers=4), datasets={"train": ds})
trainer.fit()

See the object store memory usage after the job finishes at 17:15.

Screenshot 2024-04-09 at 6 05 23 PM
stephanie-wang commented 7 months ago

Ah, this is because the Ray object store allocates objects within the shared memory pool. As long as Ray is still up, you will see this shared memory usage.