This PR removes the weight encoder. It trains just about as well as the version without and gets to a similar loss. However, the version without is not affected by masked regions either, likely because it is trained through the loss, where those regions have zero weight.
So, we'd use something that is 50% slower for no good reason. What makes matters worse is that the attention coming from the weights does not actually modulate the signal attention in the expected way: low weights should reduce the attention in those regions, but this is not the case:
We can see that the largest weight region in the center generates low attention.
This imprints itself on the signal attention, which strongly favors the very red and very blue ends of the spectrum (for all examples I have checked):
If we only look for the signal attention (here for all channels), it does show significant variation in the parts of the spectrum with lines and breaks:
I conclude from this that the weight attention probably confused the signal attention and likely led to us missing more important features in the middle part of the spectrum.
This PR removes the weight encoder. It trains just about as well as the version without and gets to a similar loss. However, the version without is not affected by masked regions either, likely because it is trained through the loss, where those regions have zero weight.
So, we'd use something that is 50% slower for no good reason. What makes matters worse is that the attention coming from the weights does not actually modulate the signal attention in the expected way: low weights should reduce the attention in those regions, but this is not the case: We can see that the largest weight region in the center generates low attention.
This imprints itself on the signal attention, which strongly favors the very red and very blue ends of the spectrum (for all examples I have checked):
If we only look for the signal attention (here for all channels), it does show significant variation in the parts of the spectrum with lines and breaks:
I conclude from this that the weight attention probably confused the signal attention and likely led to us missing more important features in the middle part of the spectrum.