madebyollin / acapellabot

Acapella Extraction with a ConvNet
http://madebyoll.in/posts/cnn_acapella_extraction/
205 stars 44 forks source link

network construction question #7

Open kgoodboy opened 6 years ago

kgoodboy commented 6 years ago

Hi,

I came across your post below: http://madebyoll.in/posts/cnn_acapella_extraction/

I am wondering how did you come up with the neural network below?

mashup = Input(shape=(None, None, 1), name='input') convA = Conv2D(64, 3, activation='relu', padding='same')(mashup) conv = Conv2D(64, 4, strides=2, activation='relu', padding='same', use_bias=False)(convA) conv = BatchNormalization()(conv)

convB = Conv2D(64, 3, activation='relu', padding='same')(conv) conv = Conv2D(64, 4, strides=2, activation='relu', padding='same', use_bias=False)(convB) conv = BatchNormalization()(conv)

conv = Conv2D(128, 3, activation='relu', padding='same')(conv) conv = Conv2D(128, 3, activation='relu', padding='same', use_bias=False)(conv) conv = BatchNormalization()(conv) conv = UpSampling2D((2, 2))(conv)

conv = Concatenate()([conv, convB]) conv = Conv2D(64, 3, activation='relu', padding='same')(conv) conv = Conv2D(64, 3, activation='relu', padding='same', use_bias=False)(conv) conv = BatchNormalization()(conv) conv = UpSampling2D((2, 2))(conv)

conv = Concatenate()([conv, convA]) conv = Conv2D(64, 3, activation='relu', padding='same')(conv) conv = Conv2D(64, 3, activation='relu', padding='same')(conv) conv = Conv2D(32, 3, activation='relu', padding='same')(conv) conv = Conv2D(1, 3, activation='relu', padding='same')(conv) acapella = conv

Is there any reference starting point or is there any reasoning behind this?

Can you leave me an email address, so we can discuss more easily? Mine is zstarstu@gmail.com

kgoodboy commented 6 years ago

Also, the input shape is set to (None, None, 1), is there a more detailed explanation?

As far as I know, the default librosa stft has 1025 as the first dimension, second dimension would be sampling points along time axis. Third dimension you have 1, I assume it is the spectrogram amplitude? But then wouldn't it lose the phase of the stft? It seems the conv network does not take complex numbers.

madebyollin commented 6 years ago

The neural network architecture was roughly based off of pix2pix https://arxiv.org/pdf/1611.07004.pdf, although it's fairly generic. There are probably better architectural choices–I didn't do a thorough hyperparameter sweep!

The input shape is (None, None, 1) since the width and height of inputs are allowed to vary. The phase is discarded and reconstructed later–this results in some artifacts, but for human voices it's not too bad. You could try modifying the network to input/output phase information as well (so, 2 input channels and 2 output channels)–I haven't tried this, but it might improve the quality a bit.

kgoodboy commented 6 years ago

Thanks!

I was also trying to use conv net for some audio processing. The trouble I have now is that the model does not converge.

For the spectrogram, do you use a regular stft or do you use the melspectrogram?

Thanks,

On Tue, Jun 12, 2018 at 2:48 PM, Ollin Boer Bohan notifications@github.com wrote:

The neural network architecture was roughly based off of pix2pix https://arxiv.org/pdf/1611.07004.pdf, although it's fairly generic. There are probably better architectural choices–I didn't do a thorough hyperparameter sweep!

The input shape is (None, None, 1) since the width and height of inputs are allowed to vary. The phase is discarded and reconstructed later–this results in some artifacts, but for human voices it's not too bad. You could try modifying the network to input/output phase information as well (so, 2 input channels and 2 output channels)–I haven't tried this, but it might improve the quality a bit.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/madebyollin/acapellabot/issues/7#issuecomment-396745706, or mute the thread https://github.com/notifications/unsubscribe-auth/AjoKxuKakO9sjGHHji2pTrA713hPt-xOks5t8Dc8gaJpZM4UjwMH .

madebyollin commented 6 years ago

The model in this project uses a regular stft (and, as mentioned earlier, only the amplitude).