LLamaTokenizer with `use_fast=True` / and `use_fast=False` causing memory leak when used with multiprocessing / `` #1495

Open michaelfeil opened 2 months ago

michaelfeil commented 2 months ago

When running a with num_proc=16, I am unable to tokenize a ~45GB dataset on a machine with >200GB Vram. The dataset consists of ~30000 rows with a string of 120-180k characters.

The memory linearly increases until it reaches max with 200GB, after just 2000 such iterations / 2000 lines..

Other things I have tried:

tokenizer_tinyllama = None

def tokenize(example, rank: int = 0): global tokenizer_tinyllama

# gc.collect()
if tokenizer_tinyllama is None:
    tokenizer_tinyllama = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0", use_fast=True)

example["input_ids"] =  tokenizer_tinyllama(example["content"], max_length=None)["input_ids"]
example["n_tokens"] = len(example["input_ids"])
example["content"] = None
return example

def main():

books3 = datasets.load_dataset("michael/set3_128k", streaming=False, keep_in_memory=False) # jsonl file, around 45GB in jsonl
# books3 = books3.shuffle()

books3_updated = books3["train"].map(

if name == "main": main()

### Env
OS: Ubuntu 22.04

PIP freeze

OS: Ubuntu 22.04

PIP freeze

datasets==2.18.0
tokenizers==0.15.2
transformers==4.39.3

michaelfeil commented 2 months ago

Update, the following function does not seem to have such a behavior.

def tokenize(example, rank: int = 0):
    # global tokenizer_tinyllama

    # chat = [
    #     {"role": "user", "content": book},
    # ]    
    # tokens = tokenizer_tinyllama.apply_chat_template(chat, tokenize=True)
    # if tokenizer_tinyllama is None:
    tokenizer_tinyllama = LlamaTokenizerFast.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0", use_fast=True)

    example["input_ids"] =  tokenizer_tinyllama(example["content"], max_length=None)["input_ids"]
    example["n_tokens"] = len(example["input_ids"])
    example["content"] = None
    return example
github-actions[bot] commented 1 month ago

michaelfeil commented 1 month ago

No, not stale!

noamgai21 commented 1 month ago

I also encounter a similar issue with 0.19.1.

noamgai21 commented 1 month ago

Opened a new issue with a more general reproduction, I believe this is a more common problem.

soldni commented 4 weeks ago

Same issue here.

ArthurZucker commented 4 weeks ago

Thanks all for these. Is the issue more with AutoTokenizer than LlamaTokenizerFast ?