Closed Hassaan68 closed 2 days ago
remove include_num_input_tokens_seen True
@hiyouga , Thank you :) . its improved but still not completely avoiding the cache. I am alsp using --overwrite_cache True
in my command. but still datasets are using huge cache memory as you can see below
17G /home/sagemaker-user/.cache/huggingface/hub
43G /home/sagemaker-user/.cache/huggingface/datasets
60G /home/sagemaker-user/.cache/huggingface
Reminder
System Info
Out of Memory error during tokenization. tried streaming and facing same issue with streaming:true and max_steps:10000. I am finetuning LLava on 93000 images and tokenizer just report No Space left on device error after tokenizing around 52000 images. I can see that my sagemaker cache is 75GB after this making the space memory full. how to counter this issue?
The command
Reproduction
There should not be memory issue and model should tokenize each image before using it and not tokenize all images together
Expected behavior
Should be able to tokenize large amount of images
Others
No response