madebyollin / acapellabot

Acapella Extraction with a ConvNet
http://madebyoll.in/posts/cnn_acapella_extraction/
205 stars 44 forks source link

CorrMM issue #6

Open wawang250 opened 6 years ago

wawang250 commented 6 years ago

Hi, i tried to running your coding but this error raised. I searched everywhere but cannot find an useful solution. I'm a fish, please help me with it. Thanks a lot. Since it's a memory issue, I am using a VM with 8G memory, deepin system.

Using Theano backend.
WARNING (theano.tensor.blas): Using NumPy C-API based implementation for BLAS functions.
 Model has 668225 params 
 Weights provided; performing inference on ['gem.wav']... 
 Loading weights 
 Attempting to isolate vocals from gem.wav 
 Retrieved spectrogram; processing... 
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/theano/compile/function_module.py", line 903, in __call__
    self.fn() if output_subset is None else\
RuntimeError: CorrMM failed to allocate working memory of 1 x 1024 x 2603755
During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/wawang250/PycharmProjects/acapellabot/acapellabot.py", line 147, in <module>
    acapellabot.isolateVocals(f, args.fft, args.phase)
  File "/home/wawang250/PycharmProjects/acapellabot/acapellabot.py", line 98, in isolateVocals
    predictedSpectrogramWithBatchAndChannels = self.model.predict(expandedSpectrogramWithBatchAndChannels)
  File "/usr/local/lib/python3.5/dist-packages/keras/engine/training.py", line 1790, in predict
    verbose=verbose, steps=steps)
  File "/usr/local/lib/python3.5/dist-packages/keras/engine/training.py", line 1299, in _predict_loop
    batch_outs = f(ins_batch)
  File "/usr/local/lib/python3.5/dist-packages/keras/backend/theano_backend.py", line 1224, in __call__
    return self.function(*inputs)
  File "/usr/local/lib/python3.5/dist-packages/theano/compile/function_module.py", line 917, in __call__
    storage_map=getattr(self.fn, 'storage_map', None))
  File "/usr/local/lib/python3.5/dist-packages/theano/gof/link.py", line 325, in raise_with_op
    reraise(exc_type, exc_value, exc_trace)
  File "/usr/lib/python3/dist-packages/six.py", line 685, in reraise
    raise value.with_traceback(tb)
  File "/usr/local/lib/python3.5/dist-packages/theano/compile/function_module.py", line 903, in __call__
    self.fn() if output_subset is None else\
RuntimeError: CorrMM failed to allocate working memory of 1 x 1024 x 2603755

Apply node that caused the error: CorrMM{half, (2, 2), (1, 1), 1 False}(InplaceDimShuffle{0,3,1,2}.0, Subtensor{::, ::, ::int64, ::int64}.0)
Toposort index: 81
Inputs types: [TensorType(float32, 4D), TensorType(float32, 4D)]
Inputs shapes: [(1, 64, 769, 13525), (64, 64, 4, 4)]
Inputs strides: [(2662585600, 41602900, 54100, 4), (4, 256, -65536, -16384)]
Inputs values: ['not shown', 'not shown']
Outputs clients: [[Subtensor{int64:int64:int8, int64:int64:int8, int64:int64:int8, :int64:}(CorrMM{half, (2, 2), (1, 1), 1 False}.0, ScalarFromTensor.0, ScalarFromTensor.0, Constant{1}, Constant{0}, Constant{64}, Constant{1}, ScalarFromTensor.0, ScalarFromTensor.0, Constant{1}, ScalarFromTensor.0)]]

Backtrace when the node is created(use Theano flag traceback.limit=N to make it longer):
  File "/home/wawang250/PycharmProjects/acapellabot/acapellabot.py", line 130, in <module>
    acapellabot = AcapellaBot()
  File "/home/wawang250/PycharmProjects/acapellabot/acapellabot.py", line 31, in __init__
    conv = Conv2D(64, 4, strides=2, activation='relu', padding='same', use_bias=False)(convA)
  File "/usr/local/lib/python3.5/dist-packages/keras/engine/topology.py", line 603, in __call__
    output = self.call(inputs, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/keras/layers/convolutional.py", line 164, in call
    dilation_rate=self.dilation_rate)
  File "/usr/local/lib/python3.5/dist-packages/keras/backend/theano_backend.py", line 1913, in conv2d
    filter_dilation=dilation_rate)

HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node.
madebyollin commented 6 years ago

It's an out-of-memory error; the code currently tries to process songs all at once (rather than splitting them up and processing segments individually), and that's a problem for long songs!

A quick-and-dirty workaround is just to split up your input file into separate files and run it on each of those separately, then join the outputs. https://unix.stackexchange.com/questions/280767/how-do-i-split-an-audio-file-into-multiple https://superuser.com/questions/571463/how-do-i-append-a-bunch-of-wav-files-while-retaining-not-zero-padded-numeric

The good way is to have the code automatically slice up the input until it can fit a slice in memory, and then run each slice through and reassemble them for you.

Try adding a predict method on acapellabot.py line 88, something like:

def predict(self, spectrogram):
        expandedSpectrogram = conversion.expandToGrid(spectrogram, self.peakDownscaleFactor)
        sliceSizeTime = 6144
        predictedSpectrogramWithBatchAndChannels = None
        while sliceSizeTime >= self.peakDownscaleFactor and predictedSpectrogramWithBatchAndChannels is None:
            try:
                slices = conversion.chop(expandedSpectrogram, sliceSizeTime, expandedSpectrogram.shape[0])
                outputSlices = []
                for s in slices:
                    sWithBatchAndChannels = s[np.newaxis, :, :, :]
                    outputSlices.append(self.model.predict(sWithBatchAndChannels))
                predictedSpectrogramWithBatchAndChannels = np.concatenate(outputSlices, axis=2)
            except (RuntimeError, MemoryError):
                console.info(sliceSizeTime, "is too large; trying", sliceSizeTime // 2)
                sliceSizeTime = sliceSizeTime // 2
        predictedSpectrogram = predictedSpectrogramWithBatchAndChannels[0, :, :, :]
        newSpectrogram = predictedSpectrogram[:spectrogram.shape[0], :spectrogram.shape[1]]
        return newSpectrogram

and replacing lines 95-103 with:

newSpectrogram = self.predict(spectrogram)
newAudio = conversion.spectrogramToAudioFile(newSpectrogram, fftWindowSize=fftWindowSize, phaseIterations=phaseIterations)

(this is copy-pasted from my v2 code, which uses stereo instead of mono, so there might be some issues actually getting it to run–will try to test later...)

wawang250 commented 6 years ago

I try to split the song into much smaller pieces, and it worked perfectly! Thanks a lot for your early reply!

Besides, I'm wondering what will happen if I run the full song on a server with large memory.

Actually what I am trying to do is separating two or more people's voices from each other. I think training this model with my own training set might be necessary. Got any tips for that?

madebyollin commented 6 years ago

The network is fully convolutional, so there's not much of a difference between running it on segments and running it on the whole thing (the one possible difference is artifacts at the boundaries between sections).

Multi-speaker source separation (from a monophonic source) is something this architecture will probably do poorly at. The way this architecture is designed right now, it only really makes judgement about individual harmonics, which isn't enough to separate speakers. For example, here's a vocal over sin/square wave chords–it's incredibly easy for the network to identify the vocals, since all you need to do is filter out all of the straight lines:

screen shot 2017-12-14 at 04 51 27

I would suggest using a deeper U-net architecture (to take a larger context into account) if you want to do multi-speaker separation. Even that will only be able to succeed by memorizing facts about specific speakers, though... a better implementation might have two input spectrograms to a large U-net: the multi-speaker spectrogram, and a "reference" spectrogram of one of the speakers, with the target output being that speaker's separated audio. Generating data for that is still pretty easy (you can probably even use my same script, just run it on lots of single-speaker recordings) but getting a good network is the tricky part. It might be worthwhile to start on a simpler case like decomposing saw waves/square waves, where it's more obvious what the network is (and should be) doing.

I found visualizing spectrograms (Audacity works well for this, as does Sonic Visualiser) was really helpful in understanding what the network should and shouldn't be able to do–if you can't tell the two speakers apart in spectrogram view (again, on a monophonic file), then it's unlikely that an image-to-image network in spectrogram space will be able to either.

jabelman commented 6 years ago

Any chance you could post your v2 or at least the implementation of the chop function you put in conversion.py, please?