Open wawang250 opened 6 years ago
It's an out-of-memory error; the code currently tries to process songs all at once (rather than splitting them up and processing segments individually), and that's a problem for long songs!
A quick-and-dirty workaround is just to split up your input file into separate files and run it on each of those separately, then join the outputs. https://unix.stackexchange.com/questions/280767/how-do-i-split-an-audio-file-into-multiple https://superuser.com/questions/571463/how-do-i-append-a-bunch-of-wav-files-while-retaining-not-zero-padded-numeric
The good way is to have the code automatically slice up the input until it can fit a slice in memory, and then run each slice through and reassemble them for you.
Try adding a predict
method on acapellabot.py
line 88, something like:
def predict(self, spectrogram):
expandedSpectrogram = conversion.expandToGrid(spectrogram, self.peakDownscaleFactor)
sliceSizeTime = 6144
predictedSpectrogramWithBatchAndChannels = None
while sliceSizeTime >= self.peakDownscaleFactor and predictedSpectrogramWithBatchAndChannels is None:
try:
slices = conversion.chop(expandedSpectrogram, sliceSizeTime, expandedSpectrogram.shape[0])
outputSlices = []
for s in slices:
sWithBatchAndChannels = s[np.newaxis, :, :, :]
outputSlices.append(self.model.predict(sWithBatchAndChannels))
predictedSpectrogramWithBatchAndChannels = np.concatenate(outputSlices, axis=2)
except (RuntimeError, MemoryError):
console.info(sliceSizeTime, "is too large; trying", sliceSizeTime // 2)
sliceSizeTime = sliceSizeTime // 2
predictedSpectrogram = predictedSpectrogramWithBatchAndChannels[0, :, :, :]
newSpectrogram = predictedSpectrogram[:spectrogram.shape[0], :spectrogram.shape[1]]
return newSpectrogram
and replacing lines 95-103 with:
newSpectrogram = self.predict(spectrogram)
newAudio = conversion.spectrogramToAudioFile(newSpectrogram, fftWindowSize=fftWindowSize, phaseIterations=phaseIterations)
(this is copy-pasted from my v2 code, which uses stereo instead of mono, so there might be some issues actually getting it to run–will try to test later...)
I try to split the song into much smaller pieces, and it worked perfectly! Thanks a lot for your early reply!
Besides, I'm wondering what will happen if I run the full song on a server with large memory.
Actually what I am trying to do is separating two or more people's voices from each other. I think training this model with my own training set might be necessary. Got any tips for that?
The network is fully convolutional, so there's not much of a difference between running it on segments and running it on the whole thing (the one possible difference is artifacts at the boundaries between sections).
Multi-speaker source separation (from a monophonic source) is something this architecture will probably do poorly at. The way this architecture is designed right now, it only really makes judgement about individual harmonics, which isn't enough to separate speakers. For example, here's a vocal over sin/square wave chords–it's incredibly easy for the network to identify the vocals, since all you need to do is filter out all of the straight lines:
I would suggest using a deeper U-net architecture (to take a larger context into account) if you want to do multi-speaker separation. Even that will only be able to succeed by memorizing facts about specific speakers, though... a better implementation might have two input spectrograms to a large U-net: the multi-speaker spectrogram, and a "reference" spectrogram of one of the speakers, with the target output being that speaker's separated audio. Generating data for that is still pretty easy (you can probably even use my same script, just run it on lots of single-speaker recordings) but getting a good network is the tricky part. It might be worthwhile to start on a simpler case like decomposing saw waves/square waves, where it's more obvious what the network is (and should be) doing.
I found visualizing spectrograms (Audacity works well for this, as does Sonic Visualiser) was really helpful in understanding what the network should and shouldn't be able to do–if you can't tell the two speakers apart in spectrogram view (again, on a monophonic file), then it's unlikely that an image-to-image network in spectrogram space will be able to either.
Any chance you could post your v2 or at least the implementation of the chop function you put in conversion.py, please?
Hi, i tried to running your coding but this error raised. I searched everywhere but cannot find an useful solution. I'm a fish, please help me with it. Thanks a lot. Since it's a memory issue, I am using a VM with 8G memory, deepin system.