Closed KexinFeng closed 1 year ago
Hi @KexinFeng There is an ongoing work to port Falcon to transformers here: https://github.com/huggingface/transformers/pull/24523 looking at that PR I believe that your issue will be fixed once merged. cc @Rocketknight1 in case I missed something!
Sorry for the delay, and yes! There is an issue with the custom code version of Falcon, which means that frequently past_key_values are not actually used in generation. This results in much lower generation speed (~3X slower for short-medium sequences).
This issue will be fixed once we add Falcon as a full library model in transformers
, and we're hoping to merge that PR extremely soon.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
This is imminent, by the by, and sorry for the delay! Should be in within the next day or two.
If anyone cant wait a few days, you can use the model here: https://github.com/kimborgen/falcon-llm
@Rocketknight1 Do you know if the transformer library takes advantage of the pararell MLP/Attention layer architecture and automatically calculates these two layers in pararell if there is enough capacity on the GPU? Or how could I enable such behaviour?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Hi @kimborgen, yes, MLP and attention are parallel paths on the newer Falcon models, rather than sequential like they are on older transformers. You can see this in the code for FalconDecoderLayer
- when parallel_attn
or new_decoder_architecture
are set, layer norms and MLP/attention follow separate, parallel paths. On the oldest Falcon models (e.g. falcon-rw-1b
) I believe they're still sequential.
Note that you should not change these settings in the config of an existing model! You'll get different outputs and the pretrained weights will be useless to you. They can only be set when the model is first initialized.
Also, since Falcon has now been fully ported into transformers
, the original issue here has been resolved and I'm going to close this issue!
System Info
transformers
version: 4.30.2Who can help?
@ArthurZucker and @younesbelkada
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
In
transformers/generation/utils.py#L2329
model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
RWForCausalLM.prepare_inputs_for_generation()
always returnNone
past_key_values
. So the result doesn’t seem to utilize the kv_cache at all. On the other hand, inRWForCausalLM.prepare_inputs_for_generation()
they do have tensor shape conversion code. Is this design thatpast_key_values
is always None intentional?Also the output text is also wired: