huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
133.54k stars 26.68k forks source link

In RWForCausalLM.prepare_inputs_for_generation, the past_key_values are always None. #24701

Closed KexinFeng closed 1 year ago

KexinFeng commented 1 year ago

System Info

Who can help?

@ArthurZucker and @younesbelkada

Information

Tasks

Reproduction

        model_name = "tiiuae/falcon-7b"
        tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
        model = AutoModelForCausalLM.from_pretrained(
            model_name, trust_remote_code=True, device_map="auto")

        # encode context the generation is conditioned on
        input_ids = tokenizer.encode('The new movie that got Oscar this year', return_tensors='pt')

        # device
        device = "cuda" if torch.cuda.is_available() else "cpu"
        input_ids = input_ids.to(device)
        # model = model.to(device)

        # %% Greedy search
        # generate text until the output length (which includes the context length) reaches 50
        greedy_output = model.generate(input_ids, max_length=50)

        print("\nOutput:\n" + 100 * '-')
        print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

        # Contrastive search
        # activate beam search and early_stopping
        output = model.generate(input_ids, penalty_alpha=0.01, top_k=4, max_length=50)

        print("\nOutput:\n" + 100 * '-')
        print(tokenizer.decode(output[0], skip_special_tokens=True))

Expected behavior

In transformers/generation/utils.py#L2329

model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)

RWForCausalLM.prepare_inputs_for_generation() always return None past_key_values. So the result doesn’t seem to utilize the kv_cache at all. On the other hand, in RWForCausalLM.prepare_inputs_for_generation() they do have tensor shape conversion code. Is this design that past_key_values is always None intentional?

Also the output text is also wired:

Output(greedy)
----------------------------------------------------------------------------------------------------
The new movie that got Oscar this year is a movie about a man who is a genius and a man who is a genius.
The movie is called “The Imitation Game” and it is about a man who is a genius and a

Output(contrastive with penalty_alpha=0.001)
----------------------------------------------------------------------------------------------------
The new movie that got Oscar this year is a (Source:
- (Source:
- (Source:
- (Source:
- (Source:
- (Source:
- (Source:
younesbelkada commented 1 year ago

Hi @KexinFeng There is an ongoing work to port Falcon to transformers here: https://github.com/huggingface/transformers/pull/24523 looking at that PR I believe that your issue will be fixed once merged. cc @Rocketknight1 in case I missed something!

Rocketknight1 commented 1 year ago

Sorry for the delay, and yes! There is an issue with the custom code version of Falcon, which means that frequently past_key_values are not actually used in generation. This results in much lower generation speed (~3X slower for short-medium sequences).

This issue will be fixed once we add Falcon as a full library model in transformers, and we're hoping to merge that PR extremely soon.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Rocketknight1 commented 1 year ago

This is imminent, by the by, and sorry for the delay! Should be in within the next day or two.

kimborgen commented 1 year ago

If anyone cant wait a few days, you can use the model here: https://github.com/kimborgen/falcon-llm

@Rocketknight1 Do you know if the transformer library takes advantage of the pararell MLP/Attention layer architecture and automatically calculates these two layers in pararell if there is enough capacity on the GPU? Or how could I enable such behaviour?

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Rocketknight1 commented 1 year ago

Hi @kimborgen, yes, MLP and attention are parallel paths on the newer Falcon models, rather than sequential like they are on older transformers. You can see this in the code for FalconDecoderLayer - when parallel_attn or new_decoder_architecture are set, layer norms and MLP/attention follow separate, parallel paths. On the oldest Falcon models (e.g. falcon-rw-1b) I believe they're still sequential.

Note that you should not change these settings in the config of an existing model! You'll get different outputs and the pretrained weights will be useless to you. They can only be set when the model is first initialized.

Also, since Falcon has now been fully ported into transformers, the original issue here has been resolved and I'm going to close this issue!