implement sentence embeddings in analysis

synapse-alpha / mirror-neuron

Experiments on bittensor reward models to find exploits

BSD 2-Clause "Simplified" License

1 stars 0 forks source link

Further explanation of method

It may be of intrest to find exploits in the reward model by means of looking for inputs which produce reliably higher rewards. This would be an attack vector and should be avoided. The plots below give a really simple example of how the sentence embedding space (with dimensional reduction) could be used to help identify regions of the semantic space which are vulnerable to attack. With this knowledge it is in principle possible to generate sentences with the same or similar embeddings and therefore hijack the reward model

newplot (3) newplot (4)

synapse-alpha / mirror-neuron

implement sentence embeddings in analysis #39

Further explanation of method