Closed w32zhong closed 7 months ago
Thanks for your interest in our work. Yes, the generation is using greedy decoding. However, there are several potential causes for the discrepancy:
error accumulation: even without quantization, it has been observed (for example, from Medusa, https://github.com/FasterDecoding/Medusa/issues/56), where the floating-point error accumulation from CUDA kernels could result in different generation results.
quantization: For INT8 quantization, are you using W8A8 (LLM.int8) or other schemes? Due to casting and mix-precision computations, error accumulation could be exacerbated for INT8 quantization. With FP16, generation outputs from our benchmarking are mostly consistent. We might look into integrating PTQ for more efficient inference or PEFT (like QLoRA) for more efficient training as well in the future.
Random initialization of n-token sequence: from the Jacobi decoding paper as well as our empirical experiments, the greedy generation outputs should be consistent with AR greedy generation. However, under the hood, since the n-token sequences are randomly initialized every time you run inference, the underlying matrix multiplications have different numerical values and can result in numerical errors.
Re generation speed tradeoff: yes, in general, for open-ended conversational questions, because the generated content is usually more diverse, the set of possible collocations is large, it usually requires more training to obtain more significant speedup. You can try asking coding or math questions (with either the spider checkpoint for text-to-SQL or GSM8K checkpoint for math) where sensible repetitive tokens appear more often to gain more significant speedup.
Hi, thanks for sharing the code.
It looks like the generation is in greedy mode by looking at the
jacobi_forward()
, however, I observe different outputs every time, and sometimes the outputs are just gibberish. I am using RTX 3060 with 8-bit quantization, not sure if the quantization is the cause of the issue here. If anyone has similar observations with full-precision using more advanced GPUs, please let me know.Also, when outputs are gibberish, the speedup are really high presumably because there are a lot of repeating patterns. When the output is good, the speedup will become lower.
Here I attach some outputs examples:
Good one:
Repeating at the end:
Repeating a lot: