Open efeiefei opened 6 years ago
If you look in the GNMT paper (https://arxiv.org/abs/1609.08144) you will see that the whole structure is constructed for fast inference. The Transformer model's main focus is on fast training and high performance. I have not run these experiments, but your results do not surprise me. The transformer is not constructed in a way that naturally lends its self to inference at all, but through the use of the cacheing mechanisms seen in this implementation it has been possible to achieve speeds on the same order of magnitude of the naturally sequential RNN models.
Indeed, the inference speed is very slow especially without GPUs. The pre-loading of the system costs much time. I am not sure if the developers have a plan to faster the inference?
I heard in Feb there were several works on non-autoregressive transformer that is a magnitude faster?
Most of them have much lower BLEU (and sometimes they are both worse and slower than greedy decoding, but who cares - non-autoregressive is cool). That said, this one looks very promising: https://arxiv.org/pdf/1805.00631.pdf
Oh this one is new to me, thanks for sharing!!
just out of curiosity what is the expected / measured decoding speed that people get on the newest t2t when decoding in batches on GPU? for regular and big transformers?
I heard from Sogo they got 252 ms from tensorflow, if you rewrite it in C, it gives you 78ms.
Sogo's WMT17 system? They should use RNNs. RNN decoder shall be faster than Multi-head attention decoder, because RNNs only need the previous hidden state for calculating the next hidden state. On the contrary, t2t decoder utilise all previous hidden states. Correct me if I am wrong.
You were right about their WMT17 system. But actually they are using transformer now. I heard from their talk in Beijing some months ago and they give the above stats.
Do they release their C version Transformer? I guess not.
I didn't bother asking lol. ;)
My decoding speed with transformer_big on a single 1080Ti GPU, including 35 seconds for loading the model:
hparams | per 3000 sents | per 1 sentence |
---|---|---|
beam_size=1, alpha=0.6, batch_size=32 | 135 s | 45 ms |
beam_size=4, alpha=0.6, batch_size=32 | 276 s | 92 ms |
beam_size=4, alpha=1.0, batch_size=32 | 398 s | 132 ms |
beam_size=4, alpha=0.6, batch_size=16 | 328 s | 109 ms |
The default decoding hparams results are marked in bold. Increasing alpha increases the time (and on some datasets 1.0 is the optimal value). Decreasing the batch size increases the time, but on my GPU it seems that bigger batch than the default 32 is not significantly faster. And of course, increasing the beam increases the time, but the default beam_size=4 seems optimal or almost optimal in most of my experiments. For reference, this is measured on the WMT en-cs newstest2013.
why not decode with multi gpu? when will the function be fulfilled?
@yuananzhang: When translating a single sentence, the decoding is fast enough for my needs (and I am not sure if it can be parallelized on multiple GPUs given the autoregressive nature of the decoder). When translating thousands of sentences I always parallelize it on multiple GPUs (and multiple machines), but I don't need any T2T support for this, just two lines of Bash. So I am not sure what your question is about.
@martinpopel What do you mean by you don't need T2T support for translating thousands of sentences?
@lkluo: When translating thousands of sentences (actually 70 millions in my case), I don't need T2T support for parallelization on multiple GPUs or machines. I just split the data to 10k sentences per file (unix command split
) and used an SGE array job which executed a separate task (on a single GPU or CPU) for each file. The tasks are completely independent, so the parallelization is embarrassingly simple. Even if T2T had some support, I would not use it (I wanted the array job with low priority, so I don't occupy the whole cluster and make my colleagues angry).
@martinpopel: Thanks for your clear explanations.
Hi @martinpopel ! That seems definitely the best solution for batch use, but what about interactive translation? We are running experiments and investigating on how we could reduce the decoding time of a single sentence. We were truly surprised by some of our results, let me explain.
First, we created a very simple test trying to reduce as much as possible the I/O and python scripting: a simple forever-loop that decode always the same sentence. Decoding time for sentence is 443ms. This is the exact time of the operator, so no other outer python stuff is involved.
CPU is a bottleneck Looking at the resources used during translation, we observe GPU usage that goes from ~50% up to ~80% of use. This was somehow expected, what was not instead is the CPU: all CPUs are used at around 20%. Because python is not (truly) multithread, this must be something at C-level. Have you any idea of what those threads are used for?
Moreover, we tried decoding on a machine with the same hardware, but a slightly faster CPU clock: the result was a slightly faster decoding time (around 400ms). So, this suggests me that the CPU is limiting the translation speed on GPU.
Model size issue We then tried to reduce the size of the model, we expect decoding time to be faster if the GPU have to compute less operations. The result was disappointing:
Conclusions Can you give us some insight on what is happening and why we are observing these results? Also can you give us some hints on how or where we should look in order to speed up the decoding process?
Thanks for your help, as always! Davide
For production (interactive) use, I think tf serving is the recommended way. I don't have personal experience with it, but there are many issues here on GitHub.
Hi @martinpopel
I admit I got a bit lost trying to following the labyrinth of nested calls :) However as far as I understood, this seems a library to expose a trained model as a "interactive server" that could be used through and API. But beside this, I failed to see where and how that code successfully speed-up the prediction process. Am I missing something? If so, can you please pointing me out an "entrypoint", or explaining a bit which are the solutions this code emplace for speeding up the translation?
Thank you very much!
Most of them have much lower BLEU (and sometimes they are both worse and slower than greedy decoding, but who cares - non-autoregressive is cool). That said, this one looks very promising: https://arxiv.org/pdf/1805.00631.pdf
@martinpopel is this integrated into tensor2tensor or are there any plans to do so ? Also is there any significant improvement in inference speed has anyone tried it ?
@Raghava14: I am not aware of any plans to integrate this into T2T.
Most of them have much lower BLEU (and sometimes they are both worse and slower than greedy decoding, but who cares - non-autoregressive is cool). That said, this one looks very promising: https://arxiv.org/pdf/1805.00631.pdf
@martinpopel is this integrated into tensor2tensor or are there any plans to do so ? Also is there any significant improvement in inference speed has anyone tried it ?
The speed up of 4x claimed in literature is only when compared with transformer with no caching strategy. Their implementation of Transformer and AAN seems to have speed difference of 20~30%. But marian implementation seems to have speed up of 2x.
Related discussions: https://github.com/bzhangGo/transformer-aan/issues/1 https://github.com/bzhangGo/transformer-aan/issues/4
I have exported serving model, as back end of restful server. I have tested performance of Transformer and GNMT, with transoformer_big and beam_size 10. Tensor2Tensor version is 1.4.4. Source tokens average length is 7.5. Both in one GPU P40. Transformer need average 300ms per sentence, but GNMT just need 100ms. Is it normal?
I think both should spend about the same time