stanfordnlp / pyvene

Stanford NLP Python Library for Understanding and Improving PyTorch Models via Interventions
http://pyvene.ai
Apache License 2.0
559 stars 50 forks source link

[P2] Add a new huggingface collator for working with Pyvene models #112

Open PinetreePantry opened 5 months ago

PinetreePantry commented 5 months ago

Suggestion / Feature Request

Pyvene is a library featuring interchange interventions. It frequently needs to process datasets that contain two sets of input_ids and (maybe) two sets of labels. When we need to train these libraries with batched datasets, the collator issue starts to arise: there is no existing collator that supports padding both sets of input_ids of different lengths at the same time.

Hugging face transformers only pad the "input_ids" entries in the dataset

In addition to above, DataCollatorForSeq2Seq only pads "labels".

So dataset entries like "source_input_ids" are not padded, a problematic issue.

Adding a utility supporting this may help pyvene develop in general.