Inference speed for Transformer Vs GNMT. Transformer is slower

efeiefei commented 6 years ago

I have exported serving model, as back end of restful server. I have tested performance of Transformer and GNMT, with transoformer_big and beam_size 10. Tensor2Tensor version is 1.4.4. Source tokens average length is 7.5. Both in one GPU P40. Transformer need average 300ms per sentence, but GNMT just need 100ms. Is it normal?

I think both should spend about the same time

benleetownsend commented 6 years ago

If you look in the GNMT paper (https://arxiv.org/abs/1609.08144) you will see that the whole structure is constructed for fast inference. The Transformer model's main focus is on fast training and high performance. I have not run these experiments, but your results do not surprise me. The transformer is not constructed in a way that naturally lends its self to inference at all, but through the use of the cacheing mechanisms seen in this implementation it has been possible to achieve speeds on the same order of magnitude of the naturally sequential RNN models.

lkluo commented 6 years ago

Indeed, the inference speed is very slow especially without GPUs. The pre-loading of the system costs much time. I am not sure if the developers have a plan to faster the inference?

colmantse commented 6 years ago

I heard in Feb there were several works on non-autoregressive transformer that is a magnitude faster?

martinpopel commented 6 years ago

Most of them have much lower BLEU (and sometimes they are both worse and slower than greedy decoding, but who cares - non-autoregressive is cool). That said, this one looks very promising: https://arxiv.org/pdf/1805.00631.pdf

colmantse commented 6 years ago

Oh this one is new to me, thanks for sharing!!

alexeib commented 6 years ago

just out of curiosity what is the expected / measured decoding speed that people get on the newest t2t when decoding in batches on GPU? for regular and big transformers?

colmantse commented 6 years ago

I heard from Sogo they got 252 ms from tensorflow, if you rewrite it in C, it gives you 78ms.

lkluo commented 6 years ago

Sogo's WMT17 system? They should use RNNs. RNN decoder shall be faster than Multi-head attention decoder, because RNNs only need the previous hidden state for calculating the next hidden state. On the contrary, t2t decoder utilise all previous hidden states. Correct me if I am wrong.

colmantse commented 6 years ago

You were right about their WMT17 system. But actually they are using transformer now. I heard from their talk in Beijing some months ago and they give the above stats.

lkluo commented 6 years ago

Do they release their C version Transformer? I guess not.

colmantse commented 6 years ago

I didn't bother asking lol. ;)

martinpopel commented 6 years ago

My decoding speed with transformer_big on a single 1080Ti GPU, including 35 seconds for loading the model:

hparams	per 3000 sents	per 1 sentence
beam_size=1, alpha=0.6, batch_size=32	135 s	45 ms
beam_size=4, alpha=0.6, batch_size=32	276 s	92 ms
beam_size=4, alpha=1.0, batch_size=32	398 s	132 ms
beam_size=4, alpha=0.6, batch_size=16	328 s	109 ms

The default decoding hparams results are marked in bold. Increasing alpha increases the time (and on some datasets 1.0 is the optimal value). Decreasing the batch size increases the time, but on my GPU it seems that bigger batch than the default 32 is not significantly faster. And of course, increasing the beam increases the time, but the default beam_size=4 seems optimal or almost optimal in most of my experiments. For reference, this is measured on the WMT en-cs newstest2013.

yuananzhang commented 6 years ago

why not decode with multi gpu? when will the function be fulfilled?

martinpopel commented 6 years ago

@yuananzhang: When translating a single sentence, the decoding is fast enough for my needs (and I am not sure if it can be parallelized on multiple GPUs given the autoregressive nature of the decoder). When translating thousands of sentences I always parallelize it on multiple GPUs (and multiple machines), but I don't need any T2T support for this, just two lines of Bash. So I am not sure what your question is about.

lkluo commented 6 years ago

@martinpopel What do you mean by you don't need T2T support for translating thousands of sentences?

martinpopel commented 6 years ago

@lkluo: When translating thousands of sentences (actually 70 millions in my case), I don't need T2T support for parallelization on multiple GPUs or machines. I just split the data to 10k sentences per file (unix command split) and used an SGE array job which executed a separate task (on a single GPU or CPU) for each file. The tasks are completely independent, so the parallelization is embarrassingly simple. Even if T2T had some support, I would not use it (I wanted the array job with low priority, so I don't occupy the whole cluster and make my colleagues angry).

lkluo commented 6 years ago

@martinpopel: Thanks for your clear explanations.

davidecaroselli commented 6 years ago

Hi @martinpopel ! That seems definitely the best solution for batch use, but what about interactive translation? We are running experiments and investigating on how we could reduce the decoding time of a single sentence. We were truly surprised by some of our results, let me explain.

First, we created a very simple test trying to reduce as much as possible the I/O and python scripting: a simple forever-loop that decode always the same sentence. Decoding time for sentence is 443ms. This is the exact time of the operator, so no other outer python stuff is involved.

CPU is a bottleneck Looking at the resources used during translation, we observe GPU usage that goes from ~50% up to ~80% of use. This was somehow expected, what was not instead is the CPU: all CPUs are used at around 20%. Because python is not (truly) multithread, this must be something at C-level. Have you any idea of what those threads are used for?

Moreover, we tried decoding on a machine with the same hardware, but a slightly faster CPU clock: the result was a slightly faster decoding time (around 400ms). So, this suggests me that the CPU is limiting the translation speed on GPU.

Model size issue We then tried to reduce the size of the model, we expect decoding time to be faster if the GPU have to compute less operations. The result was disappointing:

With a model that has 1/3 of the params of the reference model, the decoding time is 342ms (less than 25% speedup)
With a model that has 1/10 of the params of the reference model, the decoding time is 325ms (almost equals to previous case)

Conclusions Can you give us some insight on what is happening and why we are observing these results? Also can you give us some hints on how or where we should look in order to speed up the decoding process?

Thanks for your help, as always! Davide

martinpopel commented 6 years ago

For production (interactive) use, I think tf serving is the recommended way. I don't have personal experience with it, but there are many issues here on GitHub.

davidecaroselli commented 6 years ago

Hi @martinpopel

I admit I got a bit lost trying to following the labyrinth of nested calls :) However as far as I understood, this seems a library to expose a trained model as a "interactive server" that could be used through and API. But beside this, I failed to see where and how that code successfully speed-up the prediction process. Am I missing something? If so, can you please pointing me out an "entrypoint", or explaining a bit which are the solutions this code emplace for speeding up the translation?

Thank you very much!

sugeeth14 commented 5 years ago

Most of them have much lower BLEU (and sometimes they are both worse and slower than greedy decoding, but who cares - non-autoregressive is cool). That said, this one looks very promising: https://arxiv.org/pdf/1805.00631.pdf

@martinpopel is this integrated into tensor2tensor or are there any plans to do so ? Also is there any significant improvement in inference speed has anyone tried it ?

martinpopel commented 5 years ago

@Raghava14: I am not aware of any plans to integrate this into T2T.

1nsunym commented 5 years ago

Most of them have much lower BLEU (and sometimes they are both worse and slower than greedy decoding, but who cares - non-autoregressive is cool). That said, this one looks very promising: https://arxiv.org/pdf/1805.00631.pdf

@martinpopel is this integrated into tensor2tensor or are there any plans to do so ? Also is there any significant improvement in inference speed has anyone tried it ?

The speed up of 4x claimed in literature is only when compared with transformer with no caching strategy. Their implementation of Transformer and AAN seems to have speed difference of 20~30%. But marian implementation seems to have speed up of 2x.

tensorflow / tensor2tensor

Inference speed for Transformer Vs GNMT. Transformer is slower #621