pytorch / data

A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries.
BSD 3-Clause "New" or "Revised" License
1.12k stars 149 forks source link

Promise files for caching not reliably deleted #893

Open PKizzle opened 1 year ago

PKizzle commented 1 year ago

🐛 Describe the bug

With version 0.5.0 the promise files for caching are no longer reliably deleted. Instead of trying to delete the file as soon as the cache has been created torchdata seems to wait for the end of the data pipe before it tries to delete it. I did not encounter this issue with version 0.4.1

Versions

PyTorch version: 1.13.0 Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A

OS: macOS 13.1 (arm64) GCC version: Could not collect Clang version: 15.0.4 CMake version: version 3.24.3 Libc version: N/A

Python version: 3.10.8 (main, Oct 13 2022, 09:48:40) [Clang 14.0.0 (clang-1400.0.29.102)] (64-bit runtime) Python platform: macOS-13.1-arm64-arm-64bit Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

Versions of relevant libraries: [pip3] numpy==1.23.4 [pip3] pytorch-lightning==1.8.1 [pip3] torch==1.13.0 [pip3] torchdata==0.5.0 [pip3] torchdistx==0.2.0+cpu [pip3] torchmetrics==0.10.2 [pip3] torchtext==0.14.0 [pip3] torchvision==0.14.0

ejguan commented 1 year ago

Yeah. We did that to make sure the cache working with 1-to-N scenario like caching the result from untar an archive. It would be good if you can share a minimum scripts to understand your use case.

PKizzle commented 1 year ago

Would be nice to have a flag to alter the behavior for each use case. For me the current behavior just results in a caching timeout and the DataLoader2 only outputs the first element from the data pipe. I'll try to write a minimum script, but that may take some time.

PKizzle commented 1 year ago

This is my use case: I am creating a large tsv file using Polars and write it into a BytesIO buffer, then it is cached and read from the file system

import os

import torchdata.datapipes as dp
import polars as pl

from torchdata.dataloader2 import DataLoader2, MultiProcessingReadingService
from io import BytesIO

def filepath_fn(file_name):
    return os.path.join(os.path.expanduser('~/.cache/torch/text/datasets/demo'), file_name)

def demo_map(file_name):
    output = BytesIO()
    data = [{"a": 1, "b": 4} for _ in range(4000000)]
    df = pl.from_dicts(data)
    df.write_csv(output, sep="\t")
    output.seek(0)
    output_bytes = output.read()
    output.close()

    return file_name, [output_bytes]

if __name__ == '__main__':
    data_pipe = dp.iter.IterableWrapper(["demo"])
    data_pipe = data_pipe.on_disk_cache(filepath_fn=filepath_fn)
    data_pipe = data_pipe.map(demo_map)
    data_pipe = data_pipe.end_caching(mode="wb", same_filepath_fn=True, timeout=10)
    data_pipe = dp.iter.FileOpener(data_pipe, encoding="utf-8").parse_csv(delimiter="\t", skip_lines=1)

    multi_processing_reading_service = MultiProcessingReadingService(num_workers=2)
    data_loader = DataLoader2(data_pipe, reading_service=multi_processing_reading_service)

    for data in data_loader:
        continue

It only works without MultiProcessingReadingService or by adding a map after parse_csv that manually removes the promise file.

While this example uses Polars 0.14.28 to generate dummy data, the same problem appears when reading a large (>10MB) tsv file and writing it into the BytesIO buffer.

ejguan commented 1 year ago

Thank you for providing the pipeline. I will take a look!