huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.28k stars 2.7k forks source link

cache can't cleaned or disabled #7260

Open charliedream1 opened 3 weeks ago

charliedream1 commented 3 weeks ago

Describe the bug

I tried following ways, the cache can't be disabled.

I got 2T data, but I also got more than 2T cache file. I got pressure on storage. I need to diable cache or cleaned immediately after processed. Following ways are all not working, please give some help!

from datasets import disable_caching
from transformers import AutoTokenizer
disable_caching()

tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_path)
def tokenization_fn(examples):
    column_name = 'text' if 'text' in examples else 'data'
    tokenized_inputs = tokenizer(
        examples[column_name], return_special_tokens_mask=True, truncation=False,
        max_length=tokenizer.model_max_length
    )
    return tokenized_inputs

data = load_dataset('json', data_files=save_local_path, split='train', cache_dir=None)
data.cleanup_cache_files()
updated_dataset = data.map(tokenization_fn, load_from_cache_file=False)
updated_dataset .cleanup_cache_files()

Expected behavior

no cache file generated

Environment info

Ubuntu 20.04.6 LTS datasets 3.0.2