Closed arxyzan closed 5 months ago
Python is already able to clean stuff up on its own, and yes, the rust backend also cleans up after itself (unless there's a bug).
What's more likely is that the various datasets contain different data, namely different sentence lengths which trigger different kind of memory usage.
Thanks @Narsil, Another question; Does the following scenarios result in the same tokenizers:
Thanks in advance!
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Hi, I'm trying to train a new tokenizer using a Llama fast tokenizer. I have followed the instructions at https://huggingface.co/learn/nlp-course/chapter6/2#training-a-new-tokenizer. The problem is that even using batch iteration, I get OOM error and the kernel crashes and I've seen people have had such problem too and found no workaround.
My solution to this problem is to train the tokenizer in multiple steps on different shards of the dataset. (Not sure if it results into the same tokenizer compared to training in one pass!)
Reproducible Code
The reproducible code is as follows:
Problem
The problem is that still the RAM allocation gradually increases and depending on the full dataset size, OOM error can still happen. The object deletion at the end of each loop was meant to reduce memory on each pass but seems to have no effect since the tokenizer object and trainer reside in the Rust backend. I think if there'd be a way to also delete objects in the Rust backend from Python code, the problem would not arise anymore, OR maybe there is another workaround for this which I don't know!