I tried following ways, the cache can't be disabled.
I got 2T data, but I also got more than 2T cache file. I got pressure on storage. I need to diable cache or cleaned immediately after processed. Following ways are all not working, please give some help!
from datasets import disable_caching
from transformers import AutoTokenizer
disable_caching()
tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_path)
def tokenization_fn(examples):
column_name = 'text' if 'text' in examples else 'data'
tokenized_inputs = tokenizer(
examples[column_name], return_special_tokens_mask=True, truncation=False,
max_length=tokenizer.model_max_length
)
return tokenized_inputs
data = load_dataset('json', data_files=save_local_path, split='train', cache_dir=None)
data.cleanup_cache_files()
updated_dataset = data.map(tokenization_fn, load_from_cache_file=False)
updated_dataset .cleanup_cache_files()
Describe the bug
I tried following ways, the cache can't be disabled.
I got 2T data, but I also got more than 2T cache file. I got pressure on storage. I need to diable cache or cleaned immediately after processed. Following ways are all not working, please give some help!
Expected behavior
no cache file generated
Environment info
Ubuntu 20.04.6 LTS datasets 3.0.2