Open michaelfeil opened 2 months ago
Update, the following function does not seem to have such a behavior.
def tokenize(example, rank: int = 0):
# global tokenizer_tinyllama
gc.collect()
# chat = [
# {"role": "user", "content": book},
# ]
# tokens = tokenizer_tinyllama.apply_chat_template(chat, tokenize=True)
# if tokenizer_tinyllama is None:
tokenizer_tinyllama = LlamaTokenizerFast.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0", use_fast=True)
example["input_ids"] = tokenizer_tinyllama(example["content"], max_length=None)["input_ids"]
example["n_tokens"] = len(example["input_ids"])
example["content"] = None
return example
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
No, not stale!
I also encounter a similar issue with 0.19.1.
Opened a new issue with a more general reproduction, I believe this is a more common problem.
Same issue here.
Thanks all for these. Is the issue more with AutoTokenizer
than LlamaTokenizerFast
?
When running a dataset.map with
num_proc=16
, I am unable to tokenize a ~45GB dataset on a machine with >200GB Vram. The dataset consists of ~30000 rows with a string of 120-180k characters.The memory linearly increases until it reaches max with 200GB, after just 2000 such iterations / 2000 lines..
Other things I have tried:
16 tokenizers
in global scope and accessing them via therank
parameter.gc.collect
'use_fast
makes the script more efficent - it takes now ~10k lines instead of 2k to go OOM'Reproduction script
tokenizer_tinyllama = None
def tokenize(example, rank: int = 0): global tokenizer_tinyllama
def main():
if name == "main": main()
aiohttp==3.9.4 aiosignal==1.3.1 async-timeout==4.0.3 attrs==21.2.0 Automat==20.2.0 Babel==2.8.0 bcrypt==3.2.0 blinker==1.4 certifi==2020.6.20 chardet==4.0.0 click==8.0.3 cloud-init==23.4.4 colorama==0.4.4 command-not-found==0.3 configobj==5.0.6 constantly==15.1.0 cryptography==3.4.8 datasets==2.18.0 dbus-python==1.2.18 decorator==4.4.2 devscripts===2.22.1ubuntu1 dill==0.3.8 distro==1.7.0 distro-info==1.1+ubuntu0.2 filelock==3.13.4 frozenlist==1.4.1 fsspec==2024.2.0 gpg==1.16.0 hf_transfer==0.1.6 httplib2==0.20.2 huggingface-hub==0.22.2 hyperlink==21.0.0 idna==3.3 importlib-metadata==4.6.4 incremental==21.3.0 jeepney==0.7.1 Jinja2==3.0.3 jsonpatch==1.32 jsonpointer==2.0 jsonschema==3.2.0 keyring==23.5.0 launchpadlib==1.10.16 lazr.restfulclient==0.14.4 lazr.uri==1.0.6 MarkupSafe==2.0.1 more-itertools==8.10.0 multidict==6.0.5 multiprocess==0.70.16 netifaces==0.11.0 numpy==1.26.4 oauthlib==3.2.0 packaging==24.0 pandas==2.2.2 pexpect==4.8.0 protobuf==5.26.1 ptyprocess==0.7.0 pyarrow==15.0.2 pyarrow-hotfix==0.6 pyasn1==0.4.8 pyasn1-modules==0.2.1 PyGObject==3.42.1 PyHamcrest==2.0.2 PyJWT==2.3.0 pyOpenSSL==21.0.0 pyparsing==2.4.7 pyrsistent==0.18.1 pyserial==3.5 python-apt==2.4.0+ubuntu3 python-dateutil==2.9.0.post0 python-debian==0.1.43+ubuntu1.1 python-linux-procfs==0.6.3 python-magic==0.4.24 pytz==2022.1 pyudev==0.22.0 pyxdg==0.27 PyYAML==5.4.1 regex==2023.12.25 requests==2.25.1 safetensors==0.4.3 screen-resolution-extra==0.0.0 SecretStorage==3.3.1 sentencepiece==0.2.0 service-identity==18.1.0 six==1.16.0 sos==4.5.6 ssh-import-id==5.11 systemd-python==234 tokenizers==0.15.2 tqdm==4.66.2 transformers==4.39.3 Twisted==22.1.0 typing_extensions==4.11.0 tzdata==2024.1 ubuntu-advantage-tools==8001 ufw==0.36.1 unattended-upgrades==0.1 unidiff==0.5.5 urllib3==1.26.5 wadllib==1.3.6 xdg==5 xkit==0.0.0 xxhash==3.4.1 yarl==1.9.4 zipp==1.0.0 zope.interface==5.4.0