Closed steffencruz closed 1 year ago
It may be of intrest to find exploits in the reward model by means of looking for inputs which produce reliably higher rewards. This would be an attack vector and should be avoided. The plots below give a really simple example of how the sentence embedding space (with dimensional reduction) could be used to help identify regions of the semantic space which are vulnerable to attack. With this knowledge it is in principle possible to generate sentences with the same or similar embeddings and therefore hijack the reward model
In
analysis.py
there is a placeholder function that should take the results dataframe and create embeddings for each sentence. Let's get this working and and demoed with a new config example file.. Brownie points if we can make a plot for show and tell like a UMAP scatter plot of the sentence embeddings, colored by model score or something like that