Batching not speeding up Transformer-XL

I have modified the example [run_generation.py](https://github.com/huggingface/transformers/blob/master/examples/text-generation/run_generation.py) so that it can use batches. My code (pared down for the example) is below, called batch_gen.py:

#!/usr/bin/env python3
# coding=utf-8

import argparse
import logging

import numpy as np
import torch

from transformers import (
    GPT2LMHeadModel,
    GPT2Tokenizer,
    TransfoXLLMHeadModel,
    TransfoXLTokenizer,
)

logging.basicConfig(
    format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s", datefmt="%m/%d/%Y %H:%M:%S", level=logging.INFO,
)
logger = logging.getLogger(__name__)

MAX_LENGTH = int(10000)  # Hardcoded max length to avoid infinite loop
MODEL_CLASSES = {
    "gpt2": (GPT2LMHeadModel, GPT2Tokenizer),
    "transfo-xl": (TransfoXLLMHeadModel, TransfoXLTokenizer),
}

# Convert a list of prompts (strings) into batches (lists of strings,
# where each list is of size batch_size). The final batch might be
# smaller than batch_size
def batchify_prompts(prompt_list, batch_size):
    batches = []

    this_batch = []
    for prompt in prompt_list:
        this_batch.append(prompt)
        if len(this_batch) == batch_size:
            batches.append(this_batch[:])
            this_batch = []
    if len(this_batch) > 0:
        batches.append(this_batch)

    return batches

parser = argparse.ArgumentParser()
parser.add_argument("--model_type",default=None,type=str,required=True,help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),)
parser.add_argument("--model_name_or_path",default=None,type=str,required=True,help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(MODEL_CLASSES.keys()),)
parser.add_argument("--length", type=int, default=20)
parser.add_argument("--prompt_file", type=str, default=None, help="File of prompts, 1 prompt per line.")
parser.add_argument("--batch_size", type=int, default=10, help="Number of prompts to include in a batch.")
args = parser.parse_args()

args.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
args.n_gpu = torch.cuda.device_count()

# Create file to print to
output_filename = "_".join([str(x) for x in [args.model_type, args.prompt_file.split("/")[-1]]]) + ".generated"
fo = open(output_filename, "w", encoding="utf-8")

args.model_type = args.model_type.lower()
model_class, tokenizer_class = MODEL_CLASSES[args.model_type]

tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
model = model_class.from_pretrained(args.model_name_or_path)
model.to(args.device)

# Read in prompts from file
prompt_file = open(args.prompt_file, "r", encoding="utf-8")
prompt_list = []
for prompt_line in prompt_file:
    prompt_list.append(prompt_line);

prompt_batches = batchify_prompts(prompt_list, args.batch_size)

# Generate text for each prompt
for prompt_batch in prompt_batches:
    tokenizer.pad_token = "<PADDINGTOKEN>"
    tokenizer.padding_side = "left"
    encoding = tokenizer.batch_encode_plus(prompt_batch, add_special_tokens=False, return_tensors="pt", pad_to_max_length=True, add_space_before_punct_symbol=True)
    encoded_prompt = encoding["input_ids"]

    # Attention mask is not automatically returned by batch_encode_plus, so here we generate it manually
    attention_mask = 1 - (encoded_prompt == tokenizer.pad_token_id).type(torch.LongTensor)

    encoded_prompt = encoded_prompt.to(args.device)

    if encoded_prompt.size()[-1] == 0:
        input_ids = None
    else:
        input_ids = encoded_prompt

    output_sequences = model.generate(
        input_ids=input_ids,
        max_length=50 + len(encoded_prompt[0]),
        min_length=50 + len(encoded_prompt[0]),
        temperature=1.0,
        top_k=40,
        top_p=1,
        repetition_penalty=1.0,
        do_sample=True,
        num_return_sequences=1,
        attention_mask=attention_mask,
    )

    # Write the generations to the output file
    for generated_sequence_idx, generated_sequence in enumerate(output_sequences):
        fo.write("=== PROMPT ===\n")
        generated_sequence = generated_sequence.tolist()

        # Decode text
        text = tokenizer.decode(generated_sequence, clean_up_tokenization_spaces=True)

        # Add the prompt at the beginning of the sequence. Remove the excess text that was used for pre-processing
        generated_sequence = (
            text[len(tokenizer.decode(encoded_prompt[0], clean_up_tokenization_spaces=True)) :]
        )

        fo.write(prompt_batch[generated_sequence_idx] + "\n=== GENERATED ===\n")

        fo.write(generated_sequence + "\n\n")

To test the speedup provided by batching, I use a text file called prompts.txt with the following prompts:

The accompanying music video , directed by Vaughan Arnell ,
Inspired by the Beach Boys , cult surfing films ,
Premiering worldwide on Vevo on 7 January 2013 , the
The video features scenes reminiscent of the films South Pacific 
The music video garnered 10 @.@ 4 million views in
Despite a 34 % gain in weekly activity to their
191 @,@ 000 Twitter followers added contributed to their overall
Rebecca <unk> of E ! Online praised its " intentionally
Molly Chance , writing for Zap2it , was convinced that
Mikael Wood , the critic for Los Angeles Times ,
It is said that when he died in Osaka during
A variety of styles have been used in efforts to
As Burton Watson remarks in The Selected Poems of Du
The translators have had to contend with bringing out
One extreme on each issue is represented by Kenneth Rexroth
His are free translations , which seek to conceal the <unk>
Other translators have placed much greater weight on trying to
Vikram Seth in Three Chinese Poets uses English @-@ style
In The Selected Poems of Du Fu , Burton Watson follows the
Traditional Chinese literary criticism emphasized the life of the author
Since many of Du Fu 's poems feature morality and
Another reason , identified by the Chinese historian William Hung
For modern Western readers , " The less accurately we
Stephen Owen suggests a third factor particular to Du Fu
Most of what is known of Du Fu 's life
His paternal grandfather was Du <unk> , a noted politician
Du Fu was born in 712 ; the exact birthplace
In later life , he considered himself to belong to
He also had three half brothers and one half sister
The son of a minor scholar @-@ official , his

The following command is used to run the code with GPT-2:

python batch_gen.py --model_type=gpt2 --model_name_or_path=gpt2 --prompt_file prompts.txt --batch_size 10

With GPT-2, batching speeds up the runtime as expected: Each batch takes approximately 1 second, regardless of whether the batch size is 1, 5, or 10. However, with Transformer-XL, this is not the case. Here is the command to run with Transformer-XL:

python batch_gen.py --model_type=transfo-xl --model_name_or_path=transfo-xl-wt103 --prompt_file prompts.txt --batch_size 1

With a batch size of 1, each batch takes 3 seconds. With a batch size of 5, each batch takes 12 seconds. With a batch size of 10, each batch takes 21 seconds. Thus, batching is not providing much of a speedup compared to generating examples serially. (You can see the amount of time each batch takes by looking at the time stamps on the log messages that are printed out).

Therefore, I am wondering if there is a bug in the batching for Transformer-XL? Or is there some reason why the architecture cannot support efficient batching?

I am running this code on a p100 GPU through Ubuntu version 18.04 with PyTorch version 1.5.0 and Python version 3.7.7.

Thank you!

Hey! I've also observed this with the CMU Transformer-XL codebase. The main difference with other Transformers is the adaptive softmax, so that's the first I'd look at; does an XL model with a normal projection layer also have problems with batching ?

I was actually planning to investigate a suspected bug in HF's Transformer-XL training performance tomorrow morning, so if it's not urgent I can also take a look at that at the same time.

It's encouraging to hear that someone else has observed this! Thanks for the suggestion - I just tried turning off the adaptive softmax (by changing the line model = model_class.from_pretrained(args.model_name_or_path) to model = model_class.from_pretrained(args.model_name_or_path, adaptive=False)), but that did not change the runtimes.

It's not urgent, so it would be much appreciated if you can take a look!

So here are my observations for now, running on my laptop's RTX 2070 (transformers 2.11.0, torch 1.5.0, python 3.6.9, CUDA 10.2, no mixed precision) at training time for that other bug hunt:

passing adaptive=False does not actually do anything as far as I can tell, the adaptive attribute of config isn't used anywhere
at training time, the XL model with adaptive softmax seems to be both quicker and more batch-friendly than GPT 2 and an XL model with a normal Linear projection layer.

batch size	Adaptive XL	Linear XL	GPT-2
1	33.27 it/s	29.16 it/s	35.06 it/s
2	31.06 it/s	19.93 it/s	24.86 it/s
4	29.30 it/s	13.63 it/s	14.87 it/s
8	23.03 it/s	7.85 it/s	8.49 it/s

So that's pretty strange. What is your version of transformers ? I'll be looking at inference time now, as it may be different from training to inference. EDIT: also the case for me at inference time	batch size	Adaptive XL	Linear XL
1	286.92 it/s	197.25 it/s	216.45 it/s
2	264.54 it/s	102.02 it/s	109.74 it/s
4	214.71 it/s	56.27 it/s	59.91 it/s
8	148.69 it/s	30.35 it/s	31.97 it/s

Another lead is the einsum function; it's used in transformer-XL but doesn't look like it is used in GPT-2, and I know that it can behave poorly sometimes especially in mixed-precision settings. Are you using apex?

Interesting!

I'm using transformers 2.10.0, and am not using apex.

If you're able to share the code you were using for inference time, that would be helpful, so I can try it & see if it's my code or my environment that's giving us different results.

I cleaned up the code a bit and uploaded it on Google Drive. It uses lightning and operates on real wt103 data (included in the zip) so it's not quite minimal though.

Another (more remote) possibility is an issue in batching, after looking again at my dataloader code it was a bit more complex than usual to support transfoXL memories.

Thanks for the code! The main difference I see between your code and mine is that I am using the generate function, whereas you don't. After looking into the generate function for Transformer-XL, I believe I have found a bug.

Here is code that uses greedy generation without the generate function:

from transformers import TransfoXLLMHeadModel, TransfoXLTokenizer
import torch

tokenizer = TransfoXLTokenizer.from_pretrained("transfo-xl-wt103")
model = TransfoXLLMHeadModel.from_pretrained('transfo-xl-wt103')

generated = tokenizer.encode("The Manhattan Bridge")
context = torch.tensor([generated])
mems = None

for i in range(100):
    print(i)
    output, mems = model(context, mems=mems)[:2]
    token = torch.argmax(output[..., -1, :])

    generated += [token.tolist()]
    context = token.unsqueeze(0).unsqueeze(0)

sequence = tokenizer.decode(generated)

print(sequence)

This generates the following text:

The Manhattan Bridge, = = = = The Bridge = = = = The bridge over the Delaware River was built in the late 19th century by the Delaware and Hudson Canal Company. The bridge was built in the style of a drawbridge, with a single span of 1 @,@ 200 feet ( 370 m ). The bridge was designed by John Roebling, who also designed the Delaware River Bridge. The bridge was built in the style of a drawbridge, with a single span of 1 @,@ 200 feet ( 370 m

The code below should also generate the same text, just using the generate function:

from transformers import TransfoXLLMHeadModel, TransfoXLTokenizer
import torch

tokenizer = TransfoXLTokenizer.from_pretrained("transfo-xl-wt103")
model = TransfoXLLMHeadModel.from_pretrained('transfo-xl-wt103')
model.to("cuda")

generated = tokenizer.encode("The Manhattan Bridge")
context = torch.tensor([generated]).to("cuda")
mems = None

print(context)

output_sequences = model.generate(
        input_ids=context,
        max_length=100 + len(generated),
        min_length=100 + len(generated),
        eos_token_id=267734,
        #temperature=1.0,
        #top_k=1,
        #top_p=1.0,
        #do_sample=True,
        #num_return_sequences=1,
)

sequence = tokenizer.decode(output_sequences[0])

print(sequence)

However, it does not give the same output; instead, it generates:

The Manhattan Bridge, the Brooklyn Bridge, the Brooklyn Bridge, the Brooklyn Bridge, the Brooklyn Bridge, the Brooklyn Bridge, the Brooklyn Bridge, the Brooklyn Bridge, the Brooklyn Bridge, = = = The Manhattan Bridge, =, " the the the the.. the , , The, The The, The New York Bridge, is a double @-@ A @-@ A @-@ The Manhattan Bridge, the Brooklyn Bridge,

I was able to fix the discrepancy by changing the prepare_inputs_for_generation function of Transformer-XL to the code below (similar to the code used for that function in GPT-2):

    def prepare_inputs_for_generation(self, input_ids, past, **model_kwargs):
        inputs = {}

        # if past is defined in model kwargs then use it for faster decoding
        if past:
            inputs["mems"] = past
            inputs["input_ids"] = input_ids[:, -1].unsqueeze(-1)
        else:
            inputs["input_ids"] = input_ids

        return inputs

With this code, the generate function gives the same output as a for-loop. In addition, this also speeds up generation substantially: My use case is generating 500-token text from 512-token prompts, and that now takes about 30 seconds per prompt, while previously it was 3 minutes per prompt. Batching also is now more helpful than before - still not as helpful as I would expect, but that doesn't matter because it's now fast enough to be perfectly useful for me.

I've made a draft pull request here: https://github.com/huggingface/transformers/pull/4826. But I'm not sure if it's ready to be submitted (I've never submitted a pull request before): some of the tests in make test fail, and I'm not sure what is required for step 5 of the pull request checklist ("Add high-coverage tests.").

huggingface / transformers

Batching not speeding up Transformer-XL #4752