tsurumeso / vocal-remover

Vocal Remover using Deep Neural Networks
MIT License
1.55k stars 222 forks source link

Multi-resolution #97

Open aufr33 opened 2 years ago

aufr33 commented 2 years ago

I have a suggestion to improve the separation quality. You probably already know about my implementation of multiband spectrograms, but I will be glad if you implement it at the neural network level.

I am not very versed in the architecture of neural networks, so I decided not to touch the network code, but used the combination of several spectrograms into one. Such spectrograms take up less memory and have better time-frequency resolution than single-band spectrograms. However, it can work even better if we connect each band to a separate network:

b1_1 = self.stg1_1st_band_net(x['band1'])
b2_1 = self.stg1_2nd_band_net(x['band2'])
b3_1 = self.stg1_3rd_band_net(x['band3'])
b4_1 = self.stg1_4th_band_net(x['band4'])

b1_2_in = torch.cat([x['band1'], b1_1], dim=1)
...

The x is a dictionary containing many spectrograms at different resolutions. For example:

1st band: sr=7350, n_fft=640, hop_length=80 2nd band: sr=7350, n_fft=320, hop_length=80 3rd band: sr=14700, n_fft=512, hop_length=160 4th band: sr=44100, n_fft=960, hop_length=480

Unused frequencies are cut out. The difficulty in implementation is that each band will contain a different number of bins.