HuggingFace warning: Setting `pad_token_id` to `eos_token_id`

stanford-crfm / helm

Holistic Evaluation of Language Models (HELM), a framework to increase the transparency of language models (https://arxiv.org/abs/2211.09110). This framework is also used to evaluate text-to-image models in HEIM (https://arxiv.org/abs/2311.04287) and vision-language models in VHELM (https://arxiv.org/abs/2410.07112).

Apache License 2.0

1.93k stars 248 forks source link

When running with the example run_spec: mmlu:subject=anatomy,model=openai/gpt2 and no caching, the HuggingFace client outputs the following warning on every call:

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation

This is annoying, but I'm not sure if it hurts anything. I think a fix could be to pass pad_token_id to generate with the value being either:

eos_token_id specified here
If that isn't set, default it to tokenizer.eos_token_id as recommended in this stackoverflow.

However, I'm not sure what impacts, if any, this could have, so I'm going to leave it to someone else to make it.

stanford-crfm / helm

HuggingFace warning: Setting `pad_token_id` to `eos_token_id` #2257