sigsep / norbert

Painless Wiener filters for audio separation
https://sigsep.github.io/norbert
MIT License
180 stars 27 forks source link

Store peak amplitude value in EXIF part of image? #9

Closed StefanUhlich-sony closed 5 years ago

StefanUhlich-sony commented 6 years ago

This is just an idea but it would be great if the absolute peak value of an audio if it is processed with norbert would not be lost.

I just did an experiment where I run encode-decode.py but now I cannot directly compute the difference (in order to see the compression artefacts) as the scale is different.

Maybe we could store this information in the EXIF part of the JPG? There is e.g. a UserComment field.

@faroit What do you think?

faroit commented 6 years ago

Maybe we could store this information in the EXIF part of the JPG? There is e.g. a UserComment field.

haha, that was exactly my plan :-) I already implemented the requirements to dump any dict/json in the UserComment exif tag. See code here I will add the reader and some unit tests later

StefanUhlich-sony commented 6 years ago

ok, great - I didn't see this :)

faroit commented 6 years ago

I just did an experiment where I run encode-decode.py but now I cannot directly compute the difference (in order to see the compression artefacts) as the scale is different.

as long as the modules are initialized per track, the inverse functions will apply the correct max that was used for the forward operation. So there should not be any issue with the encode-decode script. But again, there are unit tests missing for whole reconstruction pipeline. And lets wait for @aliutkus for the filter module.

faroit commented 6 years ago

unit tests and implementation are added in 62ee1db3e42244601bee24bb4ac460d4ad989a81, works nicely. Currently you have to still set the max manual after loading, though...

also maybe returning X, User_comment for every decode is not very user friendly, because not many people will probably use it. So we can maybe just save it as a Class member variable?

faroit commented 6 years ago

Is there any other information other than max we want to save in the metadata of the musmag dataset? I would like to finalize the dataset creation script soon and upload the dataset to zenodo.

@TE-StefanUhlich @aliutkus ?

StefanUhlich-sony commented 6 years ago

No I can not think of more right now. Just two questions/ideas:

Both might have the drawback that they need to read out the EXIF part of the JPG but could reduce quantization artefacts.

faroit commented 6 years ago

In case of stereo audio: Should we use different scalings for left/right channel and store both max-abs values?

That would make sense, yes.

For the dataset: Should we use a different scalings for the instruments/mixture?

Actually I'm not really sure if we need this max value at all as long as the dataset is created so that all targets are scaled to the max of the mixture. In the training pipeline I would just ignore the max and train with the scaled data. The max would be needed for inference. But there we use the full musdb audio signals anyways. Am I wrong here?

StefanUhlich-sony commented 6 years ago

Actually I'm not really sure if we need this max value at all as long as the dataset is created so that all targets are scaled to the max of the mixture. In the training pipeline I would just ignore the max and train with the scaled data. The max would be needed for inference. But there we use the full musdb audio signals anyways. Am I wrong here?

No, you are right and this is probably the easiest.

If we allow for different scales has the advantage that we reduce the quantization error. Sometimes, e.g. drums is very low in amplitude for Jazz recordings - then, using the same scale induced from the mixture might introduce quite some quantization error for drums. But for training this might not be a problem.

aliutkus commented 6 years ago

Hier sorry for my silence, I'm on vacations now. I have been thinking about this stéréo idea. Right now the plan is to use only the (mono) power spectral density, even for stereo signals so that the left/right scale ends up in the spatial covariance Matrix, which will be estimated at test Time. This has the advantage of halfing the size of the data set since only one soectrogram is stored (and estimated) even for stereo signals. Of course, an other option would be to predict left and right magnitudes, but it would cost lots of bytes and maybe should be kept for more advanced implémentations?

faroit commented 6 years ago

I think we may need to clarify the various use cases of norbert and the musmag dataset. I see the following ways to handle single channel vs multi channel:

Mono Separation

1. model: Mono -> Mono, filtering: mono -> mono

Examples: ratio mask on mono signals

Stereo Separation

2. model: Mono -> Mono, filtering: 2x mono -> mono

Examples: Train on left and right channels individually. For inference, we predict a mask or spectrogram for each channel to get two mono outputs that are both filtered using ratio mask.

3. model: Mono -> Mono, filtering: Stereo -> Stereo:

ratio mask applied on both channels

4. model: Mono -> Mono, filtering: Stereo -> Stereo

Multi channel wiener filter

5. model: Stereo -> Mono, _filtering: Stereo -> Stereo

Model uses spatial cues to improve mono mask. Multichannel wiener filter for extraction. Maybe not so much used in practice?

6. model: Stereo -> Stereo, _filtering: Stereo -> Stereo

True End-to-End Multichannel Models

Upmixing

7. model: Mono -> Stereo

Downmixing

8. model: Stereo -> Mono

Case 1-4. would be ok if we save just the single channel magnitudes, for 5-8 the stereo signals would be required. However I am not so sure about 5), has that been used in the past successfully?

As we are targeting education and beginners it might be better to go with just the mono dataset, as this enables enough possibilities already. Also we can could use the three channels to save mixture/vocals/accompaniment in one single jpg...

I'm also okay with two datasets: musmag and musmag-2ch