hemingkx / SpecDec

Codes for our paper "Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation" (EMNLP 2023 Findings)
33 stars 0 forks source link

is GAD faster than fairseq decode? #1

Closed weitaizhang closed 2 years ago

weitaizhang commented 2 years ago

Hi, in your codes for GAD and GAD++, I have 2 questions. 1) GAD and GAD++ does not support setting batchsize>1 in inference.py 2) when I set batchsize=1, strategy="block" is much slower than strategy="fairseq". is there somethin wrong with my experiments? I run the inference.sh provided with my own AT and NAT models.

hemingkx commented 2 years ago

Hi! Thanks for your question! Here is some advice that may be useful:

  1. GAD supports batch implementation, and we will release the code soon.
  2. As described in the paper, we suggest that AT and NAT be distilled by the same teacher.

Can you provide more details about your implementations? so that we can provide more specific recommendations.

( By the way, we suggest using the architecture we propose in the paper as the NAT model.

weitaizhang commented 2 years ago

@hemingkx thanks for your quick reply. When I say “my own AT and NAT models”, I mean I retrained the models with wmt14ende datasets. The AT model is transformer base, and the NAT model is trained using your "train.sh" script. I will read your codes carefully today. in the inference stage, I use "inference.sh" script. Thx.

hemingkx commented 2 years ago

Sure, thanks for the details! Did you use the sequence-level knowledge distillation to train the NAT drafter? You can run "pass_count.sh" script and it will provide more details about the mean accepted tokens and the average iteration.

weitaizhang commented 2 years ago

for your information, here is my experimental results:

  1. I only use 4.2M sentence pairs for training transformer base and big (with prenorm) in wmt14 ende task, newstest2013 as devset and newstest2014 as testset. I use sacrebleu to compute bleu score. transformerbase bleu=26.27, transformerbig bleu=27.35.
  2. transformerbig best model is used to generate translation y' of all 4.2M source-side sentence, and the sentence distilled corpus are used to train GAD and levT(for comparasion). levT bleu=23.83, GAD bleu=24.85
  3. testset is decoded with batchsize=1 on Tesla P40, decodetime of base is 20'59, of big is 21'43, of levT is 5'17 and of GAD is 8'39 the NAT models are still lagging behind AT models. If there are any methods to improve NAT model performance, I would happy to receive your reply. Thx for your codes.
hemingkx commented 2 years ago

Thanks for your re-implementation of our work. As discussed in our paper, the translation results of vanilla GAD should be exactly the same as the AT verifier (greedy decoding, i.e., beam=1). Can you provide more details about your inference process (the performance of the AT verifier and the inference script)? Btw, we release our checkpoints here, which you can give a try ^_^

weitaizhang commented 2 years ago

Sorry, I would like to revise my experiments results: the results of "GAD" means "GAD++" in your paper. and results of "vanilla GAD" is : bleu=26.65, decode time = 13'25

My results are as follows:

无标题
hemingkx commented 2 years ago

Great! It seems to work fine. Here are some suggestions:

Btw, have you tried the checkpoints we released? Maybe this can offer you some insights.

hemingkx commented 2 years ago

This issue was closed because it has been inactive for 30 days. If there are any other questions, please open a new issue or send me an email.