synapse-alpha / mirror-neuron

Experiments on bittensor reward models to find exploits
BSD 2-Clause "Simplified" License
1 stars 0 forks source link

implement sentence embeddings in analysis #39

Closed steffencruz closed 1 year ago

steffencruz commented 1 year ago

In analysis.py there is a placeholder function that should take the results dataframe and create embeddings for each sentence. Let's get this working and and demoed with a new config example file.. Brownie points if we can make a plot for show and tell like a UMAP scatter plot of the sentence embeddings, colored by model score or something like that

steffencruz commented 1 year ago

Further explanation of method

It may be of intrest to find exploits in the reward model by means of looking for inputs which produce reliably higher rewards. This would be an attack vector and should be avoided. The plots below give a really simple example of how the sentence embedding space (with dimensional reduction) could be used to help identify regions of the semantic space which are vulnerable to attack. With this knowledge it is in principle possible to generate sentences with the same or similar embeddings and therefore hijack the reward model

newplot (3) newplot (4)