Open mitchellgordon95 opened 2 years ago
Can you try to put <|endoftext|>
into bad_words_list
and print bad_words_ids
in end_to_end_test.py
You may not put correct input for the function.
$ cp tools/end_to_end_test.py tools/end_to_end_test_2.py
$ vim tools/end_to_end_test_2.py
$ diff tools/end_to_end_test.py tools/end_to_end_test_2.py
138c138
< ["Hawks, Hawks"],
---
> ["<|endoftext|>"],
145a146
> print(f'bad_words_list: {bad_words_list}')
171c172
< print(output0, output1, output2)
---
> print(f'BAD_WORDS_IDS: {output3}')
$ python tools/end_to_end_test_2.py
bad_words_list: [['<|endoftext|>']
['']
['']
['']
['']
['']
['']
['']]
============After preprocessing============
BAD_WORDS_IDS: [[[ 27 91 437 1659 5239 91 29]
[ 7 -1 -1 -1 -1 -1 -1]]
[[ 0 0 0 0 0 0 0]
[ -1 -1 -1 -1 -1 -1 -1]]
[[ 0 0 0 0 0 0 0]
[ -1 -1 -1 -1 -1 -1 -1]]
[[ 0 0 0 0 0 0 0]
[ -1 -1 -1 -1 -1 -1 -1]]
[[ 0 0 0 0 0 0 0]
[ -1 -1 -1 -1 -1 -1 -1]]
[[ 0 0 0 0 0 0 0]
[ -1 -1 -1 -1 -1 -1 -1]]
[[ 0 0 0 0 0 0 0]
[ -1 -1 -1 -1 -1 -1 -1]]
[[ 0 0 0 0 0 0 0]
[ -1 -1 -1 -1 -1 -1 -1]]]
===========================================
The offending line is probably this https://github.com/triton-inference-server/fastertransformer_backend/blob/main/all_models/gptj/preprocessing/1/utils/gpt_token_encoder.py#L91
But I don't understand what it does
$ python
Python 3.6.9 (default, Dec 8 2021, 21:08:43)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import regex as re
>>> pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
>>> re.findall(pat, "<|endoftext|>")
['<|', 'endoftext', '|>']
Because of how the regex parses <|endoftext|>
into three parts, we can never bpe-encode the whole string together
https://github.com/triton-inference-server/fastertransformer_backend/blob/main/all_models/gptj/preprocessing/1/utils/gpt_token_encoder.py#L138
Hi, @mitchellgordon95. Thank you for the feedback. As you say, there are some issues in current converter. A simple solution to prevent it is using the tokenizer of huggingface to replace current tokenizer directly. For example, replacing
def to_word_list_format(word_dict):
tokenizer = get_tokenizer()
by
from transformers import AutoTokenizer
def to_word_list_format(word_dict):
cache_dir = Path(__file__).parent / ".cache"
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B", cache_dir=cache_dir)
We are also considering a better solution to do tokenization.
Is there a solution to this? I am unable to prevent GPT-J from generating <|endoftext|> tokens right now.
We had a workaround for this, but I no longer have access to the codebase where I was working on it.
I believe what we ended up doing was editing the to_word_list_format
function to have a special case for the <|endoftext|>
string, manually adding 50256 to the list of banned tokens if EOT is present in the list.
Description
Expected behavior:
Reproduced Steps
Actual behavior:
BPE merges seem to be working correctly. However, during pre-tokenization,
<|endoftext|>
is broken up into<|
,endoftext
,|>
, with merges being applied to each of the parts separately. This seems incorrect, if we're using the huggingface implementation as reference.I came across this bug trying to ban
<|endoftext|>
using thebad_words
parameter.