Confused about dataset creation

vgel / repeng

A library for making RepE control vectors

https://vgel.me/posts/representation-engineering/

MIT License

435 stars 31 forks source link

Confused about dataset creation #26

Closed Hellisotherpeople closed 4 months ago

Hellisotherpeople commented 4 months ago

Why is it that the current dataset code creates a larger dataset out of a small dataset by basically creating a new example for each token in the small dataset?

Couldn't I take a larger dataset and simply create the control vector from just the sentence or examples in my document?

The make dataset code seems overly and needlessly complicated unless there is some motivation for why it's being done the way you are doing it right now.

Also, this package is AWESOME! Thank you so much for making it!

vgel commented 4 months ago

Yes, you should be able to do that (assuming you have appropriate negative examples). The notebooks are doing a kind of data augmentation to expand the tiny dataset (the 1-3 persona pairs) into a few thousand examples to get enough data to train a vector on. If you have a large dataset already, then that data augmentation isn't needed—you can just train the vectors on the paired examples directly.

Hellisotherpeople commented 4 months ago

Sounds good! Thank you for the quick and clear response! Yeah, the dataset augmentation strategy makes sense in regards to exploiting a small dataset the maximal amount possible.

I will work to either integrate this into llm_steer-oobabooga or to create it's own "oobabooga" extension integration of this technique since this comment makes it far more straight forward for letting users bring their own dataset.