tsurumeso / vocal-remover

Vocal Remover using Deep Neural Networks
MIT License
1.54k stars 221 forks source link

Jetson nano implementation #69

Closed nkcdy closed 3 years ago

nkcdy commented 3 years ago

I tried to run the inferece on jetson nano platform, for a 28 seconds music, it took 35 seconds to separate the song into vocal and music. the network seems too big to suit for real-time processing on lightweight platform. Is there any effort on how to reduce the network size?

aufr33 commented 3 years ago

Try increasing the window size to 768. You will not be able to reduce the size of the model; this will require retraining.

nkcdy commented 3 years ago

Try increasing the window size to 768. You will not be able to reduce the size of the model; this will require retraining.

the inference process is killed by linxu kernal after the window_size is changed from 512 to 784.... so, either the capability of nano is limited or the network is too big. I can accept retraining the model, would you like to provide any advice?

aufr33 commented 3 years ago

Not 784, but 768 (512+256).

To train the model yourself, you need a GPU with 11 GB VRAM. In addition, it will require code changes.

nkcdy commented 3 years ago

Not 784, but 768 (512+256). To train the model yourself, you need a GPU with 11 GB VRAM. In addition, it will require code changes.

Its my fault, i'll try 768 laterly. I have a Tesla V100 GPU with 16G RAM in hand, it is enough for retraining. The question is how to change the code. :D

nkcdy commented 3 years ago

It will be very usefull if the network can be implemented on a micro controller such as arm cortext-m4 for some real time application.

aufr33 commented 3 years ago

Unfortunately, I cannot help you with the code. The only way I know is to decrease the sample rate (in inference and training).

tsurumeso commented 3 years ago

Sorry for the late reply. The easiest way to reduce the network size is to reduce the channel size. The code below is a sample code with the channel size halved. Please try to replace the original code (https://github.com/tsurumeso/vocal-remover/blob/master/lib/nets.py#L40-L60) with this (untested):

class CascadedASPPNet(nn.Module):

    def __init__(self, n_fft):
        super(CascadedASPPNet, self).__init__()
        self.stg1_low_band_net = BaseASPPNet(2, 8)
        self.stg1_high_band_net = BaseASPPNet(2, 8)

        self.stg2_bridge = layers.Conv2DBNActiv(10, 4, 1, 1, 0)
        self.stg2_full_band_net = BaseASPPNet(4, 8)

        self.stg3_bridge = layers.Conv2DBNActiv(18, 8, 1, 1, 0)
        self.stg3_full_band_net = BaseASPPNet(8, 16)

        self.out = nn.Conv2d(16, 2, 1, bias=False)
        self.aux1_out = nn.Conv2d(8, 2, 1, bias=False)
        self.aux2_out = nn.Conv2d(8, 2, 1, bias=False)

        self.max_bin = n_fft // 2
        self.output_bin = n_fft // 2 + 1

        self.offset = 128
nkcdy commented 3 years ago

Sorry for the late reply. The easiest way to reduce the network size is to reduce the channel size. The code below is a sample code with the channel size halved. Please try to replace the original code (https://github.com/tsurumeso/vocal-remover/blob/master/lib/nets.py#L40-L60) with this (untested): class CascadedASPPNet(nn.Module):

def __init__(self, n_fft):
    super(CascadedASPPNet, self).__init__()
    self.stg1_low_band_net = BaseASPPNet(2, 8)
    self.stg1_high_band_net = BaseASPPNet(2, 8)

    self.stg2_bridge = layers.Conv2DBNActiv(10, 4, 1, 1, 0)
    self.stg2_full_band_net = BaseASPPNet(4, 8)

    self.stg3_bridge = layers.Conv2DBNActiv(18, 8, 1, 1, 0)
    self.stg3_full_band_net = BaseASPPNet(8, 16)

    self.out = nn.Conv2d(16, 2, 1, bias=False)
    self.aux1_out = nn.Conv2d(8, 2, 1, bias=False)
    self.aux2_out = nn.Conv2d(8, 2, 1, bias=False)

    self.max_bin = n_fft // 2
    self.output_bin = n_fft // 2 + 1

    self.offset = 128

thanks for your reply. But the first step should be trying to reproduce the same results as yours with original network and orginal dataset. May I ask what dataset you used for your training?

nkcdy commented 3 years ago

I trained the model with my own dataset(400 pairs) and got very good results after 50 epochs. The loss is still descreasing now and the total epochs are 300, i'll keep training to see what the final loss will be.

nkcdy commented 3 years ago

BTW, its really a excessive memory consumer, it takes nearly 50GB ram to do the tranining. why not do some preprocessing and load "*.npy" files from disk instead of reading all the npy files into RAM?

nkcdy commented 3 years ago

I retrained the size-reduced network with the the method @tsurumeso mentioned. On Jetson Nano platform, for a 28 seconds music, it takes around 20 seconds to do the separation with no siganificant performance degradation. The separation speed is slightly faster than real time with the reduced model.

One thing to be noted is that, on my linux server, the same music, it takes only 15 seconds to do the separation even with CPU(not GPU). So, I guess the speed limit may be not the parallel compution but the waveform preprocessing.

Anyway, thanks for the great works. @tsurumeso