pretrain data format is a little bit similar to the sft stage

shibing624 / MedicalGPT

MedicalGPT: Training Your Own Medical GPT Model with ChatGPT Training Pipeline. 训练医疗大模型，实现了包括增量预训练(PT)、有监督微调(SFT)、RLHF、DPO、ORPO。

Apache License 2.0

3.24k stars 492 forks source link

pretrain data format is a little bit similar to the sft stage #61

Closed chlinfeng1997 closed 1 year ago

chlinfeng1997 commented 1 year ago

Describe the Question

Please provide a clear and concise description of what the question is.

I notice that the pretrain data downloaded from the Hub is organized into JSON format, with each line having a text field containing the question and answer. Is the purpose of doing this to directly output a correctly formatted response during inferencing on pretraining model? Why not in document form, where each line is descriptive text, but in question-answer form as input?

shibing624 commented 1 year ago

1.想用train_file_dir,可以把hf dataset 的数据集手动下载后，自己解析下json为txt格式就可。

hf hub dataset便于统一格式，上传的是json

chlinfeng1997 commented 1 year ago

你好，可能我的问题表达不明确，我的问题是医疗的预训练数据为什么是组织成每行<问题+答案>的形式，例如 XXXX?YYYYY，如下图所示： 1688436886753

而不是直接每行是医疗相关的文本，如下所示： 1688437231080

shibing624 commented 1 year ago

都行，我用的医疗问答百科的数据，他的格式是这样。

chlinfeng1997 commented 1 year ago

另一个问题是通过block_size来划分sequence来训练，推理训练好的预训练模型，输出不会自己截断，会不停输出下去，但是用baichuan7B原始权重进行推理，它却会自己停止，表现为最后一个字符是eos，想问一下对于预训练模型是自己能停止好还是一直不断输出的好呢？还有它能实现输出停止符的原因，是通过每行末尾手动添加停止符来进行训练，还是通过别的什么方法？

shibing624 commented 1 year ago

达到max length 就会停

hardfish82 commented 1 year ago

ChatGLM2加载后是没有eos_token的，我看微调的时候专门还加了。

tokenizer.add_special_tokens({
            "eos_token": "</s>",
            "bos_token": "<sop>",
            "unk_token": "<unk>",
        })

但是实际微调后的模型进行infer的时候，遇到eos_token还是不会自动停止，这是什么原因？

shibing624 commented 1 year ago

https://github.com/shibing624/MedicalGPT/commit/7dc86561b553a160641359b28b871f1b5b54fd28

valkryhx commented 1 year ago

ChatGLM2加载后是没有eos_token的，我看微调的时候专门还加了。
tokenizer.add_special_tokens({
            "eos_token": "</s>",
            "bos_token": "<sop>",
            "unk_token": "<unk>",
        })
但是实际微调后的模型进行infer的时候，遇到eos_token还是不会自动停止，这是什么原因？

我昨天也查看了chatglm2的tokenizer.eos_token_id 其实有值为2 但是感觉没用？我也把预训练文本末尾加上了但是二次pretrain之后输出还是一样不中断直到达到设定的maxlength。

hardfish82 commented 1 year ago

7dc8656

大佬好，这个修改看了一下，没理解为什么可以解决这个BUG？我infer的时候仍然会有大量的，只能自己加split拦截。顺便问一下，能否帮忙解释一下ChatGLM2的SpecailToken为什么弄得这么奇怪，感觉看得有点晕菜。

hardfish82 commented 1 year ago

ChatGLM2加载后是没有eos_token的，我看微调的时候专门还加了。
tokenizer.add_special_tokens({
            "eos_token": "</s>",
            "bos_token": "<sop>",
            "unk_token": "<unk>",
        })
但是实际微调后的模型进行infer的时候，遇到eos_token还是不会自动停止，这是什么原因？
我昨天也查看了chatglm2的tokenizer.eos_token_id 其实有值为2 但是感觉没用？我也把预训练文本末尾加上了但是二次pretrain之后输出还是一样不中断直到达到设定的maxlength。

我总感觉ChatGLM2在SpecialToken上埋了坑，有点理不顺

shibing624 commented 1 year ago

ChatGLM2-6b是经过SFT意图对齐的，不建议再去做PT，建议直接用来做领域数据SFT;
关于special tokens,ChatGLM2-6b已经加了pad_token和eos_token了的，SFT是会自动停止的；你去PT，是按block size最大长度做next word prediction任务，是会丢失pad token的，所以不中断，一直输出到 max length;
当时加add_special_tokens是因为chatglm2-6b刚发布时有token的bug，后来官方已经修复，此处代码不会其作用了。

valkryhx commented 1 year ago

感谢您的解答！