pfnet / pfio

IO library to access various filesystems with unified API
https://pfio.readthedocs.io/
MIT License
52 stars 20 forks source link

memory leak related with MultiprocessFileCache? #339

Open dshintani-pfn opened 1 month ago

dshintani-pfn commented 1 month ago

I observed the possible memory leak (~1GB/h) related with MultiprocessFileCache during training.

I defined the dataset class with cache as tutorial.

class CachedDataset:
    def __init__(
        self,
        common_config: TrainAdamCommonConfig,
    ) -> None:
        self._reader_dict = {
            dataset.name: File(dataset.name, mode="a") for dataset in common_config.datasets
        }
        self._cache = MultiprocessFileCache(len(self), do_pickle=True)

    def _load_from_disk(self, i: int) -> TrainData:
        return ...

    def __getitem__(self, i: int) -> Any:
        return self._cache.get_and_cache(i, self._load_from_disk)

and used this CachedDataset as dataset below for training.

train_set, val_set = torch.utils.data.random_split(
    dataset,
    [int(len(dataset) * train_set_ratio), len(dataset) - int(len(dataset) * train_set_ratio)],
)

train_loader = DataLoader(
    train_set, batch_size=train_args.batch_size, shuffle=True, collate_fn=collate_fn
)

This leakage was solved when I stopped using MultiprocessFileCache.

It might be due to the wrong usage of MultiprocessFileCache, but do you have any idea about this leakage?