hiyouga / LLaMA-Factory

A WebUI for Efficient Fine-Tuning of 100+ LLMs (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
26.59k stars 3.29k forks source link

Trainer is adding a blank character before response. #896

Closed xzuyn closed 8 months ago

xzuyn commented 10 months ago

So I'm trying to do some SFT, but whenever I train it ends up inserting a space or some sort of blank character right before the response. Happens with the templates I tried like Vicuna and Alpaca. Like this with Alpaca:

### Instruction:
This is a test instruction.

### Response:
 This is a test response.

Or like this with Vicuna:

USER: This is a test message. ASSISTANT:  This is a test response.

I tried looking through template.py and preprocess.py, but I'm unsure how to fix it.

hiyouga commented 10 months ago

The llama tokenizer adds a prefix space before the sequences. https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/tokenization_llama.py

xzuyn commented 10 months ago

The llama tokenizer adds a prefix space before the sequences. https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/tokenization_llama.py

What specifically adds that?

Also I don't have this issue with other trainers like LLaMa-LoRA-Tuner or Zeus-LLM-Trainer. It also doesn't seem to happen when training DPO.

As far as I can tell its only happening on SFT, and the unwanted space is only on the response.

hiyouga commented 10 months ago

The space may come from these pad tokens, and it will not be added in training. I will fix this problem.

https://github.com/hiyouga/LLaMA-Efficient-Tuning/blob/8857e4560219c4052bdb7c7dc1a014a5f5fd0163/src/llmtuner/dsets/preprocess.py#L143-L145

xzuyn commented 10 months ago

it will not be added in training

When generating results from the trained SFT QLoRA to use for DPO data I could see it generating that space so it is being trained on. So maybe its not a pad token, but an actual space.

hiyouga commented 10 months ago

Currently there is no prefix space in supervised fine-tuning, please update your code.

xzuyn commented 10 months ago

But my code is your code. Using the latest commit it inserts something.

input_ids:
[1, 13866, 338, 385, 15278, 393, 16612, 263, 3414, 29889, 14350, 263, 2933, 393, 7128, 2486, 1614, 2167, 278, 2009, 29889, 29871, 13, 13, 835, 2799, 4080, 29901, 13, 9314, 14816, 29903, 6778, 3492, 526, 385, 319, 29902, 1904, 1058, 4266, 7093, 297, 17558, 29889, 887, 1101, 11994, 29892, 694, 4383, 278, 4967, 19423, 829, 14816, 29903, 6778, 13, 13, 5328, 1258, 278, 10122, 310, 4628, 26532, 297, 278, 4688, 19859, 6602, 278, 12409, 322, 4978, 310, 25494, 322, 6501, 4383, 29973, 13, 13, 2277, 29937, 13291, 29901, 13, 450, 10122, 310, 4628, 26532, 297, 278, 4688, 19859, 5318, 263, 7282, 6297, 297, 278, 12409, 322, 4978, 310, 25494, 322, 6501, 4383, 29889, 6054, 26532, 29892, 10734, 2428, 25379, 573, 4628, 26532, 29892, 526, 13112, 304, 505, 8429, 297, 278, 4688, 19859, 515, 278, 24382, 310, 20364, 10489, 27091, 470, 278, 2778, 5743, 310, 7968, 4628, 26532, 29889, 4525, 4628, 26532, 28482, 278, 12409, 322, 14675, 310, 25494, 322, 6501, 4383, 297, 3196, 5837, 29901, 13, 13, 29896, 29889, 4989, 29894, 277, 1288, 1098, 13857, 29901, 6054, 26532, 29892, 1641, 20364, 3618, 29892, 429, 814, 263, 4549, 26618, 1288, 8206, 373, 1009, 8388, 618, 886, 29889, 910, 26618, 1288, 1098, 13857, 9213, 11705, 10489, 29892, 19786, 29892, 322, 6501, 4383, 29892, 10201, 8236, 304, 278, 12409, 310, 25494, 2820, 1438, 4628, 26532, 29889, 450, 4628, 26532, 13674, 27320, 408, 409, 5779, 363, 15400, 29891, 12409, 29889, 13, 13, 29906, 29889, 4831, 2267, 291, 322, 16705, 29901, 1094, 4383, 20074, 964, 263, 4628, 16188, 29892, 372, 7190, 385, 1035, 2267, 291, 8086, 2820, 278, 4628, 16188, 29889, 910, 1889, 27474, 263, 14586, 355, 681, 5253, 310, 5864, 297, 278, 883, 310, 27310, 29892, 607, 508, 12871, 701, 322, 16346, 675, 278, 18830, 10489, 29889, 910, 27310, 508, 884, 7899, 13988, 8805, 29879, 322, 432, 1691, 29892, 607, 508, 13031, 10489, 322, 19786, 714, 310, 278, 6555, 12786, 310, 278, 25391, 15400, 29891, 29889, 910, 16705, 1889, 508, 1072, 5987, 278, 14321, 310, 25494, 322, 278, 4978, 310, 6501, 4383, 491, 4046, 292, 278, 5253, 310, 10489, 3625, 363, 5810, 12409, 29889, 13, 13, 29941, 29889, 4702, 5743, 322, 22060, 29901, 6054, 26532, 508, 6548, 491, 2778, 3460, 411, 916, 4628, 26532, 470, 491, 1035, 276, 1259, 4158, 515, 1009, 8388, 618, 886, 29889, 4525, 2778, 5743, 322, 22060, 508, 4556, 25494, 304, 10366, 408, 1532, 29892, 8236, 304, 278, 12409, 310, 7200, 25494, 322, 278, 2654, 391, 3224, 310, 6501, 4383, 2629, 963, 29889, 450, 2778, 3460, 310, 4628, 26532, 508, 884, 7738, 26618, 1288, 20037, 29892, 607, 508, 8677, 3448, 5864, 322, 6401, 19399, 29892, 6602, 292, 278, 19753, 310, 278, 15400, 29891, 322, 967, 6501, 4383, 4978, 29889, 13, 13, 29946, 29889, 15317, 4383, 8870, 359, 29901, 15317, 4383, 338, 13112, 304, 883, 8870, 359, 2820, 25494, 29892, 13138, 278, 26618, 1288, 885, 3470, 1025, 292, 363, 15400, 29891, 12409, 29889, 450, 10122, 310, 4628, 26532, 297, 278, 4688, 19859, 1033, 505, 28482, 278, 4978, 310, 6501, 4383, 491, 13978, 292, 372, 304, 1009, 9467, 13593, 29892, 4550, 528, 21430, 278, 6501, 4383, 8870, 359, 322, 6602, 292, 278, 12463, 4978, 310, 25494, 297, 278, 19859, 29889, 13, 13, 797, 15837, 29892, 4628, 26532, 297, 278, 4688, 19859, 5318, 263, 7618, 1455, 6297, 297, 278, 12409, 322, 4978, 310, 25494, 322, 6501, 4383, 29889, 11275, 26618, 1288, 9949, 29892, 1035, 2267, 291, 10174, 29892, 2778, 5743, 29892, 322, 22060, 411, 6501, 4383, 599, 26869, 304, 528, 21430, 278, 2919, 29899, 7052, 3829, 310, 278, 19859, 591, 14111, 9826, 29889, 2]
inputs:
<s> Below is an instruction that describes a task. Write a response that appropriately completes the request. 

 ### Instruction:
<<SYS>>You are an AI model who specializes in physics. You follow instructions, no matter the subject.<</SYS>>

How did the presence of black holes in the early universe affect the formation and distribution of galaxies and dark matter?

### Response:
 The presence of black holes in the early universe played a significant role in the formation and distribution of galaxies and dark matter. Black holes, particularly supermassive black holes, are believed to have formed in the early universe from the collapse of massive gas clouds or the mergers of smaller black holes. These black holes influenced the formation and evolution of galaxies and dark matter in several ways:

1. Gravitational attraction: Black holes, being massive objects, exert a strong gravitational pull on their surroundings. This gravitational attraction helped gather gas, dust, and dark matter, eventually leading to the formation of galaxies around these black holes. The black holes essentially acted as seeds for galaxy formation.

2. Accretion and feedback: As matter falls into a black hole, it forms an accretion disk around the black hole. This process releases a tremendous amount of energy in the form of radiation, which can heat up and ionize the surrounding gas. This radiation can also drive powerful winds and jets, which can blow gas and dust out of the central regions of the forming galaxy. This feedback process can regulate the growth of galaxies and the distribution of dark matter by limiting the amount of gas available for star formation.

3. Mergers and interactions: Black holes can grow by merging with other black holes or by accreting mass from their surroundings. These mergers and interactions can cause galaxies to merge as well, leading to the formation of larger galaxies and the redistribution of dark matter within them. The merging of black holes can also produce gravitational waves, which can carry away energy and angular momentum, affecting the dynamics of the galaxy and its dark matter distribution.

4. Dark matter halos: Dark matter is believed to form halos around galaxies, providing the gravitational scaffolding for galaxy formation. The presence of black holes in the early universe could have influenced the distribution of dark matter by attracting it to their vicinity, thus shaping the dark matter halos and affecting the overall distribution of galaxies in the universe.

In summary, black holes in the early universe played a crucial role in the formation and distribution of galaxies and dark matter. Their gravitational influence, accretion processes, mergers, and interactions with dark matter all contributed to shaping the large-scale structure of the universe we observe today.</s>
label_ids:
[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 450, 10122, 310, 4628, 26532, 297, 278, 4688, 19859, 5318, 263, 7282, 6297, 297, 278, 12409, 322, 4978, 310, 25494, 322, 6501, 4383, 29889, 6054, 26532, 29892, 10734, 2428, 25379, 573, 4628, 26532, 29892, 526, 13112, 304, 505, 8429, 297, 278, 4688, 19859, 515, 278, 24382, 310, 20364, 10489, 27091, 470, 278, 2778, 5743, 310, 7968, 4628, 26532, 29889, 4525, 4628, 26532, 28482, 278, 12409, 322, 14675, 310, 25494, 322, 6501, 4383, 297, 3196, 5837, 29901, 13, 13, 29896, 29889, 4989, 29894, 277, 1288, 1098, 13857, 29901, 6054, 26532, 29892, 1641, 20364, 3618, 29892, 429, 814, 263, 4549, 26618, 1288, 8206, 373, 1009, 8388, 618, 886, 29889, 910, 26618, 1288, 1098, 13857, 9213, 11705, 10489, 29892, 19786, 29892, 322, 6501, 4383, 29892, 10201, 8236, 304, 278, 12409, 310, 25494, 2820, 1438, 4628, 26532, 29889, 450, 4628, 26532, 13674, 27320, 408, 409, 5779, 363, 15400, 29891, 12409, 29889, 13, 13, 29906, 29889, 4831, 2267, 291, 322, 16705, 29901, 1094, 4383, 20074, 964, 263, 4628, 16188, 29892, 372, 7190, 385, 1035, 2267, 291, 8086, 2820, 278, 4628, 16188, 29889, 910, 1889, 27474, 263, 14586, 355, 681, 5253, 310, 5864, 297, 278, 883, 310, 27310, 29892, 607, 508, 12871, 701, 322, 16346, 675, 278, 18830, 10489, 29889, 910, 27310, 508, 884, 7899, 13988, 8805, 29879, 322, 432, 1691, 29892, 607, 508, 13031, 10489, 322, 19786, 714, 310, 278, 6555, 12786, 310, 278, 25391, 15400, 29891, 29889, 910, 16705, 1889, 508, 1072, 5987, 278, 14321, 310, 25494, 322, 278, 4978, 310, 6501, 4383, 491, 4046, 292, 278, 5253, 310, 10489, 3625, 363, 5810, 12409, 29889, 13, 13, 29941, 29889, 4702, 5743, 322, 22060, 29901, 6054, 26532, 508, 6548, 491, 2778, 3460, 411, 916, 4628, 26532, 470, 491, 1035, 276, 1259, 4158, 515, 1009, 8388, 618, 886, 29889, 4525, 2778, 5743, 322, 22060, 508, 4556, 25494, 304, 10366, 408, 1532, 29892, 8236, 304, 278, 12409, 310, 7200, 25494, 322, 278, 2654, 391, 3224, 310, 6501, 4383, 2629, 963, 29889, 450, 2778, 3460, 310, 4628, 26532, 508, 884, 7738, 26618, 1288, 20037, 29892, 607, 508, 8677, 3448, 5864, 322, 6401, 19399, 29892, 6602, 292, 278, 19753, 310, 278, 15400, 29891, 322, 967, 6501, 4383, 4978, 29889, 13, 13, 29946, 29889, 15317, 4383, 8870, 359, 29901, 15317, 4383, 338, 13112, 304, 883, 8870, 359, 2820, 25494, 29892, 13138, 278, 26618, 1288, 885, 3470, 1025, 292, 363, 15400, 29891, 12409, 29889, 450, 10122, 310, 4628, 26532, 297, 278, 4688, 19859, 1033, 505, 28482, 278, 4978, 310, 6501, 4383, 491, 13978, 292, 372, 304, 1009, 9467, 13593, 29892, 4550, 528, 21430, 278, 6501, 4383, 8870, 359, 322, 6602, 292, 278, 12463, 4978, 310, 25494, 297, 278, 19859, 29889, 13, 13, 797, 15837, 29892, 4628, 26532, 297, 278, 4688, 19859, 5318, 263, 7618, 1455, 6297, 297, 278, 12409, 322, 4978, 310, 25494, 322, 6501, 4383, 29889, 11275, 26618, 1288, 9949, 29892, 1035, 2267, 291, 10174, 29892, 2778, 5743, 29892, 322, 22060, 411, 6501, 4383, 599, 26869, 304, 528, 21430, 278, 2919, 29899, 7052, 3829, 310, 278, 19859, 591, 14111, 9826, 29889, 2]
labels:
The presence of black holes in the early universe played a significant role in the formation and distribution of galaxies and dark matter. Black holes, particularly supermassive black holes, are believed to have formed in the early universe from the collapse of massive gas clouds or the mergers of smaller black holes. These black holes influenced the formation and evolution of galaxies and dark matter in several ways:

1. Gravitational attraction: Black holes, being massive objects, exert a strong gravitational pull on their surroundings. This gravitational attraction helped gather gas, dust, and dark matter, eventually leading to the formation of galaxies around these black holes. The black holes essentially acted as seeds for galaxy formation.

2. Accretion and feedback: As matter falls into a black hole, it forms an accretion disk around the black hole. This process releases a tremendous amount of energy in the form of radiation, which can heat up and ionize the surrounding gas. This radiation can also drive powerful winds and jets, which can blow gas and dust out of the central regions of the forming galaxy. This feedback process can regulate the growth of galaxies and the distribution of dark matter by limiting the amount of gas available for star formation.

3. Mergers and interactions: Black holes can grow by merging with other black holes or by accreting mass from their surroundings. These mergers and interactions can cause galaxies to merge as well, leading to the formation of larger galaxies and the redistribution of dark matter within them. The merging of black holes can also produce gravitational waves, which can carry away energy and angular momentum, affecting the dynamics of the galaxy and its dark matter distribution.

4. Dark matter halos: Dark matter is believed to form halos around galaxies, providing the gravitational scaffolding for galaxy formation. The presence of black holes in the early universe could have influenced the distribution of dark matter by attracting it to their vicinity, thus shaping the dark matter halos and affecting the overall distribution of galaxies in the universe.

In summary, black holes in the early universe played a crucial role in the formation and distribution of galaxies and dark matter. Their gravitational influence, accretion processes, mergers, and interactions with dark matter all contributed to shaping the large-scale structure of the universe we observe today.</s>

It seems to add a space before ### Instruction:, and after ### Response:\n

xzuyn commented 10 months ago

It seems the space is SPIECE_UNDERLINE or a token that is combined with it.

hiyouga commented 10 months ago

Sure, it is a SPIECE_UNDERLINE

xzuyn commented 10 months ago

I was wrong about it not happening with DPO training. Everything seems to be getting encoded with the underline. It seems to happen before each part of the template (prefix, prompt, system) too.

Am I meant to just ignore it? Is that just normal behaviour?

xzuyn commented 9 months ago

@hiyouga This is still an issue. Training results in exactly what I'm expecting, except it still spits out a space because that's what its being encoded and trained on. (This is not an issue with my inferencing since this carries over into GGUF with KoboldCPP, which I don't see this issue happening with other models.)

<|USER|>What is the capital of Canada?<|MODEL|> The capital of Canada is Ottawa.

The underline tokens need to get converted over to the equivalent token or tokens without the underline before being used for training.

For example the dataset above encoded the start of my dataset like: Token ID: 13866 is ▁Below, but it should be using Token IDs: 21140 & 340, which is Bel & ow.

All my samples are starting with <s> (really the space is the underline from the next mis-encoded word) instead of just <s> as well due to this.

hiyouga commented 9 months ago

We are working on this problem.

xzuyn commented 9 months ago

I think this could be solved if add_prefix_space=False is added somewhere within the tokenizer, but I cannot figure it out yet.

hiyouga commented 9 months ago

LLaMA tokenizer does not support this option https://github.com/huggingface/transformers/blob/21dc5859421cf0d7d82d374b10f533611745a8c5/src/transformers/models/llama/tokenization_llama.py#L259

huangqingyi-code commented 8 months ago

@hiyouga 你好,这个空格我看了下还是会有的。要怎么解决这个。

hiyouga commented 8 months ago

@huangqingyi-code 不影响训练,无视即可。

huangqingyi-code commented 8 months ago

我用的是vicuna的模版。‘ USER: XXX ASSISTANT: XXX',在USER前面会有个空格。 1.这个是什么原因导致的?2.推理的时候会有影响吗,推理我用的vllm,要自己把template拼接上去,在USER前面也要加个空格?

hiyouga commented 8 months ago

@huangqingyi-code 不影响,不用加空格

huangqingyi-code commented 8 months ago

@hiyouga 不好意思,还有个问题,vicuna template的assitant后面本来是有个空格,现在怎么没了。"USER: {{query}} ASSISTANT:"

huangqingyi-code commented 8 months ago

@hiyouga input_ids = system_ids+query_ids+respond_ids,system_ids单独decode最后一个字符不是空格,query_ids单独decode第一字符不是空格。system_ids+query_ids一起decode,system和query中间就多出来一个空格了。这个是什么导致的

hiyouga commented 8 months ago

@huangqingyi-code vicuna 最后一个不应该有空格 https://github.com/lm-sys/FastChat/blob/aeec0e002bffe38f490be4d7360f42ef2d90ae45/fastchat/conversation.py#L77

hiyouga commented 8 months ago

@huangqingyi-code system_ids 和 query_ids 在编码后都会各自在前面加一个 prefix space,如果单独解码则会移除该空格,放在一起解码时不会移除 query_ids 的 prefix space,就会看起来多一个

huangqingyi-code commented 8 months ago

@hiyouga 训练的时候是没有空格的,只是decode出来看才有,所以没影响?推理的时候按照template拼接就可以了是吗