LLamaTokenizer with `use_fast=True` / and `use_fast=False` causing memory leak when used with multiprocessing / `dataset.map(num_proc)`

michaelfeil commented 2 months ago

When running a dataset.map with num_proc=16, I am unable to tokenize a ~45GB dataset on a machine with >200GB Vram. The dataset consists of ~30000 rows with a string of 120-180k characters.

The memory linearly increases until it reaches max with 200GB, after just 2000 such iterations / 2000 lines..

Other things I have tried:

I have tried creating e.g. 16 tokenizers in global scope and accessing them via the rank parameter.
gc.collect'
not usage of use_fast makes the script more efficent - it takes now ~10k lines instead of 2k to go OOM'

use of AutoTokenzier,

Reproduction script


import datasets
from transformers import LlamaTokenizerFast, AutoTokenizer
import gc
N_PROCS = 16

tokenizer_tinyllama = None

def tokenize(example, rank: int = 0): global tokenizer_tinyllama

# gc.collect()
if tokenizer_tinyllama is None:
    tokenizer_tinyllama = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0", use_fast=True)

example["input_ids"] =  tokenizer_tinyllama(example["content"], max_length=None)["input_ids"]
example["n_tokens"] = len(example["input_ids"])
example["content"] = None
return example

def main():

books3 = datasets.load_dataset("michael/set3_128k", streaming=False, keep_in_memory=False) # jsonl file, around 45GB in jsonl
# books3 = books3.shuffle()

books3_updated = books3["train"].map(
    tokenize,
    num_proc=N_PROCS,
    with_rank=True,
)
books3_updated.push_to_hub(
    "michael/books3_128k_tokenized"
)

if name == "main": main()

### Env
OS: Ubuntu 22.04

PIP freeze

aiohttp==3.9.4 aiosignal==1.3.1 async-timeout==4.0.3 attrs==21.2.0 Automat==20.2.0 Babel==2.8.0 bcrypt==3.2.0 blinker==1.4 certifi==2020.6.20 chardet==4.0.0 click==8.0.3 cloud-init==23.4.4 colorama==0.4.4 command-not-found==0.3 configobj==5.0.6 constantly==15.1.0 cryptography==3.4.8 datasets==2.18.0 dbus-python==1.2.18 decorator==4.4.2 devscripts===2.22.1ubuntu1 dill==0.3.8 distro==1.7.0 distro-info==1.1+ubuntu0.2 filelock==3.13.4 frozenlist==1.4.1 fsspec==2024.2.0 gpg==1.16.0 hf_transfer==0.1.6 httplib2==0.20.2 huggingface-hub==0.22.2 hyperlink==21.0.0 idna==3.3 importlib-metadata==4.6.4 incremental==21.3.0 jeepney==0.7.1 Jinja2==3.0.3 jsonpatch==1.32 jsonpointer==2.0 jsonschema==3.2.0 keyring==23.5.0 launchpadlib==1.10.16 lazr.restfulclient==0.14.4 lazr.uri==1.0.6 MarkupSafe==2.0.1 more-itertools==8.10.0 multidict==6.0.5 multiprocess==0.70.16 netifaces==0.11.0 numpy==1.26.4 oauthlib==3.2.0 packaging==24.0 pandas==2.2.2 pexpect==4.8.0 protobuf==5.26.1 ptyprocess==0.7.0 pyarrow==15.0.2 pyarrow-hotfix==0.6 pyasn1==0.4.8 pyasn1-modules==0.2.1 PyGObject==3.42.1 PyHamcrest==2.0.2 PyJWT==2.3.0 pyOpenSSL==21.0.0 pyparsing==2.4.7 pyrsistent==0.18.1 pyserial==3.5 python-apt==2.4.0+ubuntu3 python-dateutil==2.9.0.post0 python-debian==0.1.43+ubuntu1.1 python-linux-procfs==0.6.3 python-magic==0.4.24 pytz==2022.1 pyudev==0.22.0 pyxdg==0.27 PyYAML==5.4.1 regex==2023.12.25 requests==2.25.1 safetensors==0.4.3 screen-resolution-extra==0.0.0 SecretStorage==3.3.1 sentencepiece==0.2.0 service-identity==18.1.0 six==1.16.0 sos==4.5.6 ssh-import-id==5.11 systemd-python==234 tokenizers==0.15.2 tqdm==4.66.2 transformers==4.39.3 Twisted==22.1.0 typing_extensions==4.11.0 tzdata==2024.1 ubuntu-advantage-tools==8001 ufw==0.36.1 unattended-upgrades==0.1 unidiff==0.5.5 urllib3==1.26.5 wadllib==1.3.6 xdg==5 xkit==0.0.0 xxhash==3.4.1 yarl==1.9.4 zipp==1.0.0 zope.interface==5.4.0

michaelfeil commented 2 months ago

Update, the following function does not seem to have such a behavior.

def tokenize(example, rank: int = 0):
    # global tokenizer_tinyllama

    gc.collect()
    # chat = [
    #     {"role": "user", "content": book},
    # ]    
    # tokens = tokenizer_tinyllama.apply_chat_template(chat, tokenize=True)
    # if tokenizer_tinyllama is None:
    tokenizer_tinyllama = LlamaTokenizerFast.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0", use_fast=True)

    example["input_ids"] =  tokenizer_tinyllama(example["content"], max_length=None)["input_ids"]
    example["n_tokens"] = len(example["input_ids"])
    example["content"] = None
    return example

github-actions[bot] commented 1 month ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

michaelfeil commented 1 month ago

No, not stale!

noamgai21 commented 1 month ago

I also encounter a similar issue with 0.19.1.

noamgai21 commented 1 month ago

Opened a new issue with a more general reproduction, I believe this is a more common problem.

soldni commented 4 weeks ago

Same issue here.

ArthurZucker commented 4 weeks ago

Thanks all for these. Is the issue more with AutoTokenizer than LlamaTokenizerFast ?

huggingface / tokenizers

LLamaTokenizer with `use_fast=True` / and `use_fast=False` causing memory leak when used with multiprocessing / `dataset.map(num_proc)` #1495

Reproduction script