Closed Louis-y-nlp closed 1 year ago
@Louis-y-nlp : MPT was trained with bf16 datatype and that's what you should use for inference. I tried it with bf16 and I don't see garbled text.
sents = ['Explain to me the difference between nuclear fission and fusion.']
print(generate_text(sents)[0])
"""
Nuclear reactions are processes by which atomic nuclei undergo changes in their internal structure, resulting from either collisions or through controlled manipulation of energy released during these interactions with other particles such as photons (light) emitted when electrons move away due radioactive decay within atoms' nucleus after being excited into higher orbits around its core
"""
Thank you for your reply. However, it seems that it doesn't work for me, and there are still garbled outputs, especially when I try to prompt the model to generate non-English characters, such as Chinese, using Hugging Face's inference code works fine.
The default parameter repetition_penalty=5 is unreasonable. By reducing it to around 1.1, I can obtain the same results as torch. I will close this issue and thank you again.
@dskhudia Nice to see you again. I have a couple of other questions. Firstly, have you tested the speed improvement using ft? I conducted tests using the V100 GPU in the recommended Docker environment provided by nvidia. When using 1 GPU with a batch size of 1, the inference speed of FT is similar to FastChat. However, when using 2 GPUs, the program gets stuck after inferring a few questions. I am unable to pinpoint the issue, and any help would be greatly appreciated.
Secondly, I observed that when a maximum length is given, the length of the generated outputs (gen_outputs) in inference is always padded to the maximum length with EOS tokens. I'm unsure whether the model performed inference for that extended length and generated an infinite number of EOS tokens(that will cost lot of time) or if it's just a result of padding. Thanks again for your assistance.
Hey @dskhudia are you able to help out here?
@Louis-y-nlp , Sorry for the late reply. I missed it.
1) I am not familiar with FastChat and haven't run it. However, in comparison to HF generate we saw > 2x speedup at batch size 1 for the 7B model. We have run multi-gpu inference successfully with FT without any hang so not sure about the root cause. 2) Is this with FT?
Closing due to inactivity. Please feel free to open a new issue if you are still encountering problems.
Thank you for your great work. I converted an MPT-7B-Instruct model to the FT format and successfully ran inference, but I obtained some unexpected results (usually some garbled text but is not completely unrelated), such as
I assure you that I used the correct prompt format, and most of the parameters in the inference code were set to their default values. I'm not sure where the issue lies. I would greatly appreciate it if you could provide a correctly converted and tested model that I can use to determine whether the problem lies with my code or the converted model. Additionally, here is my demo scripts.
here are some arguments