🧹 This is a tracker regarding the move of prepare_inputs_for_generation into the generation mixin 🧹
Why?
prepare_inputs_for_generation is not part of the core modeling, but rather a utility for generate
it should greatly reduce the need to touch modeling code, on generate changes. Fewer modeling changes -> improved model stability
greatly reduced number of lines of code 🙏
Tracker
Kinda ordered list of tasks:
[x] 1. Fix related slow tests before we start — all llama, generate, and cache_utils [except sink cache, broken atm] slow tests should be passing to ensure we don’t break anything (#33138)
[x] 2. PreTrainedModel doesn't inherit from GenerationMixin, so that can_generate() becomes independent of prepare_inputs_for_generation being overwritten or not (#33203)
[x] 3. Move llama’s prepare_inputs_for_generation to the generation mixin. This implies moving one function that prepares the 4D mask too (the one that is called there) (#33677)
[x] 5. Address the case of synced_gpus in generate: when synced_gpus and cache_positions is out of bounds, take the latest available input_ids for dummy computations (https://github.com/huggingface/transformers/pull/34095)
[x] 7. Change prepare_inputs_for_generation to forward **kwargs from its input to its output. With minimal changes, this should enable most VLMs to use the shared function -- they forward pixel_values from the input to the output (support for **kwargs https://github.com/huggingface/transformers/pull/33870)
[x] 8. By this point most cases of prepare_inputs_for_generation should have been removed 🤗 We would need to check the others individually, there may be further simplification patterns available!
🧹 This is a tracker regarding the move of
prepare_inputs_for_generation
into the generation mixin 🧹Why?
prepare_inputs_for_generation
is not part of the core modeling, but rather a utility forgenerate
generate
changes. Fewer modeling changes -> improved model stabilityTracker
Kinda ordered list of tasks:
llama
,generate
, andcache_utils
[except sink cache, broken atm] slow tests should be passing to ensure we don’t break anything (#33138)PreTrainedModel
doesn't inherit fromGenerationMixin
, so thatcan_generate()
becomes independent ofprepare_inputs_for_generation
being overwritten or not (#33203)prepare_inputs_for_generation
to the generation mixin. This implies moving one function that prepares the 4D mask too (the one that is called there) (#33677)prepare_inputs_for_generation
— currently we don’t test it directly, and we should (decoder-only llms: https://github.com/huggingface/transformers/pull/33870, encoder-decoder llms: https://github.com/huggingface/transformers/pull/34048)synced_gpus
ingenerate
: whensynced_gpus
andcache_positions
is out of bounds, take the latest availableinput_ids
for dummy computations (https://github.com/huggingface/transformers/pull/34095)prepare_inputs_for_generation
from as many models as possible. There may be merge conflicts here, due to the 4D mask function. Try to iron out as many trivial cases as possible (decoder-only llms: https://github.com/huggingface/transformers/pull/33870, encoder-decoder llms: https://github.com/huggingface/transformers/pull/34048)prepare_inputs_for_generation
to forward**kwargs
from its input to its output. With minimal changes, this should enable most VLMs to use the shared function -- they forwardpixel_values
from the input to the output (support for **kwargs https://github.com/huggingface/transformers/pull/33870)prepare_inputs_for_generation
should have been removed 🤗 We would need to check the others individually, there may be further simplification patterns available!