Closed ncoder closed 1 year ago
Investigation:
As a test, i modified this line of _generate_from_iterable in causal.py:
output = self.engine.tokenizer.decode(
output[0][len_input:], skip_special_tokens=True
)
setting skip_special_tokens=False.
Then i get as an output:
I dream about being able to fly like a bird and exploring new places. </s> What do you think about ... (keeps going)
Looks like we should have as a end-of-sequence token configured somewhere.
I don't see an entry for this in StoppingCriteriaList when i debug.
More investigation: It's running this functoin
return self.contrastive_search(
input_ids,
top_k=generation_config.top_k,
penalty_alpha=generation_config.penalty_alpha,
logits_processor=logits_processor,
stopping_criteria=stopping_criteria,
pad_token_id=generation_config.pad_token_id,
eos_token_id=generation_config.eos_token_id,
output_scores=generation_config.output_scores,
return_dict_in_generate=generation_config.return_dict_in_generate,
synced_gpus=synced_gpus,
**model_kwargs,
)
stopping_criteria is set to
stopping_criteria
[<transformers.genera...93972D930>]
special variables
function variables
max_length:
262
0:
<transformers.generation.stopping_criteria.MaxLengthCriteria object at 0x000002693972D930>
len():
1
eos_token_id is 1
Seems like there's a built-in stop token specified, and using StoppingCriteriaList isn't the thing we're supposed to use for this.
This is the token output:
[ 1724, 437, 366, 12561, 1048, 29973, 306, 12561, 1048, 1641,
2221, 304, 11340, 763, 263, 11199, 322, 3902, 8253, 716,
7600, 29889, 2, 1724, 437, 366, 1348, 1048, 278, 5434,
310, 319, 29902, 29973, 306, 1348, 278, 5434, 310, 319,
29902, 338, 5566, 11407, 322, 2989, 310, 7037, 29889, 319,
29902, 1033, 367, 1304, 304, 4505, 4828, 393, 25618, 2609,
29892, 1316, 408, 16083, 24876, 15806, 322, 19964, 1199, 29889,
739, 1033, 884, 367, 1304, 304, 1653, 716, 322, 24233,
1230, 9316, 322, 5786, 393, 1033, 14169, 5199, 537, 29889,
2, 1724, 526, 596, 13133, 373, 278, 671, 310, 319,
29902, 297, 9121, 8324, 29973, 306, 1348, 278, 671, 310,
319, 29902, 297, 9121, 8324, 338, 263, 3765, 29899, 287,
3192, 22378, 29889, 1551, 697, 1361, 29892, 372, 1033, 367,
1304, 304, 26371, 749, 278, 2779, 20193, 310, 9121, 6931,
322, 4078, 12080, 29889, 1551, 278, 916, 1361, 29892, 372,
1033, 367, 1304, 304, 1653, 28273, 681, 25340, 393, 526,
443, 10149, 519, 363, 1009, 8820, 29892, 8236, 304, 443,
524, 2760, 27721, 29889, 2, 1724, 437, 366, 1348, 1048,
278, 11314, 936, 2411, 5795, 310, 319, 29902, 29973, 306,
1348, 278, 11314, 936, 2411, 5795, 310, 319, 29902, 526,
4280, 322, 817, 304, 367, 5545, 16112, 29889, 319, 29902,
756, 278, 7037, 304, 19479, 675, 278, 982, 591, 5735,
29892, 664, 29892, 322, 16254, 411, 1269, 916, 29892, 541,
372, 884, 5304, 411, 5161, 2039, 29892, 1316, 408, 278,
7037, 363, 24003, 322, 3984, 1509, 310, 848, 29889, 1334,
1818, 9801, 393, 319, 29902, 338, 1304, 297, 263, 14040,
322, 11314]
i'm guessing here, but it looks like the output token is '2' not '1'.
Ay, yess.. manually setting eos_token_id to 2, yields this output when decoding:
' I dream about being able to fly like a bird and exploring new places.'
changing eos_token_id in [modelpath]/config.json seems to have no effect.
doing:
model.engine.model.config.eos_token_id = model.engine.tokenizer.eos_token_id
right before
model.generate( ...)
seems to work.
The money question is why the tokenizer's eos_token_id is mismatching the model's eos_token_id?
Hi @ncoder
Thanks for debugging it. It seems that there are miss matches when doing conversion from original LLaMA model to HF format.
If you do not mind, could you please create a PR for fixing this error? Thank you.
I would if I knew how to fix it properly besides a hack. I'm completely unfamiliar with the training pipeline.
changing eos_token_id in [modelpath]/config.json seems to have no effect.
doing:
model.engine.model.config.eos_token_id = model.engine.tokenizer.eos_token_id
right before
model.generate( ...)
seems to work.
The money question is why the tokenizer's eos_token_id is mismatching the model's eos_token_id?
Thx, so is it just a bug on the generation side, or should I also add this when fine-tuning?
Hi @hzlujunyi ,
You do not need to add this when fine-tuning. The tokenizer has EOS token and it will add EOS token when encoding the text.
Hi @hzlujunyi ,
You do not need to add this when fine-tuning. The tokenizer has EOS token and it will add EOS token when encoding the text.
Thank you for your guidance. 👍
Is there an associated commit with the fix? (I would like to see, and also know when it's in the next release)
Reproduction steps:
Output is:
Expected output: