huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.28k stars 2.7k forks source link

Cannot load the cache when mapping the dataset #7261

Open zhangn77 opened 3 weeks ago

zhangn77 commented 3 weeks ago

Describe the bug

I'm training the flux controlnet. The train_dataset.map() takes long time to finish. However, when I killed one training process and want to restart a new training with the same dataset. I can't reuse the mapped result even I defined the cache dir for the dataset.

with accelerator.main_process_first(): from datasets.fingerprint import Hasher

    # fingerprint used by the cache for the other processes to load the result
    # details: https://github.com/huggingface/diffusers/pull/4038#discussion_r1266078401
    new_fingerprint = Hasher.hash(args)
    train_dataset = train_dataset.map(
        compute_embeddings_fn, batched=True, new_fingerprint=new_fingerprint, batch_size=10,
    )

Steps to reproduce the bug

train flux controlnet and start again

Expected behavior

will not map again

Environment info

latest diffusers