Pyvene is a library featuring interchange interventions. It frequently needs to process datasets that contain two sets of input_ids and (maybe) two sets of labels. When we need to train these libraries with batched datasets, the collator issue starts to arise: there is no existing collator that supports padding both sets of input_ids of different lengths at the same time.
Suggestion / Feature Request
Pyvene is a library featuring interchange interventions. It frequently needs to process datasets that contain two sets of input_ids and (maybe) two sets of labels. When we need to train these libraries with batched datasets, the collator issue starts to arise: there is no existing collator that supports padding both sets of input_ids of different lengths at the same time.
Hugging face transformers only pad the "input_ids" entries in the dataset
In addition to above, DataCollatorForSeq2Seq only pads "labels".
So dataset entries like "source_input_ids" are not padded, a problematic issue.
Adding a utility supporting this may help pyvene develop in general.