Is there some example of the paper? e.g., compare of the inference latency

microsoft / torchscale

Foundation Architecture for (M)LLMs

MIT License

3.01k stars 202 forks source link

Yeah, the experiments are based on our private data and pipelines, which are not appropriate for open-source. Following our guidelines in README, RetNet will be easy to integrate into your own training procedure without Fairseq ( the import method is identical to Transformer). For inference speed, our experiments use incremental_state for auto-regressive decoding. Here is a pseudo example:

incremental_state = {}
net_input = torch.rand(bsz, tgt_len)
for index in range(net_input.shape[1]):
    generation_net_input = net_input[:, :(index + 1)]
    generation_net_output, _ = model(generation_net_input, incremental_state=incremental_state)
    net_input[:, index + 1] = torch.argmax(generation_net_output[0], dim=-1)

In every step, incremental_state stores the past state (k/v cache for Transformer, kv state for RetNet). You can modify it on your own inference pipeline.

microsoft / torchscale

Is there some example of the paper? e.g., compare of the inference latency #53