swharden / Spectrogram

.NET library for creating spectrograms (visual representations of frequency spectrum over time)
https://nuget.org/packages/Spectrogram
MIT License
315 stars 57 forks source link

Saving SFF with Complex data #19

Closed KJ-Waller closed 4 years ago

KJ-Waller commented 4 years ago

Hi again! Thanks for fixing reading the Apple formatted wav files. I'm trying to read sff files in Python as shown in the demo, however, it seems the complex values aren't saved in the sff file. I get "isComplex = False" when reading my sff files in Python using the sffLib. Is there something I need to do to save the sff files with complex values in C#? Regards, Kevin

swharden commented 4 years ago

Hey Kevin, thanks for reporting this! I'll get this fixed today or tomorrow. I recently updated the SFF file format to support Mel spectrograms, and while I updated the C# SFF reading code I didn't go back and update the python script so my guess is that Python isn't reading the file properly. Regardless of the cause, I'll take a look at this shortly.

Just curious, do you prefer SFF files to store complex data, or flat (magnitude) data? Mentally I can't think of an application for complex spectrograms... but I'm interested to learn more about your use case!

Scott

KJ-Waller commented 4 years ago

Thanks for the quick reply Scott.

I'm building a speech emotion recognition software using deep learning in Pytorch. I have a couple of datasets of which I want to generate spectrograms for to train my deep learning models so that they work in a C# Windows Forms demo. I got decent models trained in Pytorch, however, the Pytorch spectrograms differ slightly from the spectrograms generated in C#, so I want to train my deep learning models without the Pytorch stft implementation, but with your library instead, to make sure the demo gives the same results as in Python.

Thanks for making this library, it's been a great help to building my demo. I'll be happy to share the demo with you once it's done, although it's my first time working in C# so it's quite basic.

Regards, Kevin

swharden commented 4 years ago

I want to train my deep learning models without the Pytorch stft implementation, but with your library instead, to make sure the demo gives the same results as in Python.

I'll be happy to share the demo with you once it's done

Very cool application! I'd love to see it when you're done, or even as you're working on it! Definitely let me know if you put it up on GitHub

Looking closer I see that saving SFF files with complex numbers is not yet supported.

https://github.com/swharden/Spectrogram/blob/3ea73cc8797fabfb48ea6672c9180e3db2a4700d/src/Spectrogram/SFF.cs#L96-L97

However, this is probably okay for your application. I left in the potential to support saving complex spectrograms because they can be useful for applications which apply special filtering (in the real and imaginary domain) then run the inverse FFT to convert the spectrogram back to an original waveform. For applications like you describe, complex SFF files are probably not what you want to train a model on, but instead you'd train the model on spectrogram with single floating point numbers rather than complex number. Does this sound reasonable? If so, me fixing the demo python script will fix this issue, and we'll leave saving complex data in SFF files unsupported for now... what do you think?

swharden commented 4 years ago

Oh man, I just realized the python module is outputting complex numbers whether the SFF file is complex or not! Now I understand the core issue causing confusion here. I'll make sf.values be a simple numpy array of regular floating point numbers.

To simplify this I'm going to rip out all the complex number support, make all the demos clean and easy, then come back later and add complex support if/when a real life use case ever presents itself

EDIT: I got very confused by numpy's poorly-aligned output using scientific notation 😝 I think the module is working properly, but I can still improve it...

swharden commented 4 years ago

I think this is in a good place now! Let me know if the demo doesn't work for you, or if you have any further questions regarding the SFF reading module or ideas for how you think it could be improved! I look forward to learning more about progress of your application.

Demo Application

https://github.com/swharden/Spectrogram/blob/a98f0facc6e1ddc0b0fccd651a57e97f975f14cb/dev/sff/sffDemo.py#L1-L31

Demo Application Output

image

PS C:\Users\scott\Documents\GitHub\Spectrogram\dev\sff> python .\sffDemo.py

Spectrogram from file: hal.sff
SFF version: 1.1      

Sample rate: 44100 Hz 
Step size: 700 samples
Step count: 232 steps 

FFT size: 4096        
FFT first index: 0    
FFT height: 185       
FFT offset: 0 Hz      

Values per point: 1   
Complex values: False 
Bytes per point: 8
Decibels: False

Mel bin count: 0
image width: 232
image height: 185

Time Resolution: 0.015873015873015872 sec/px
Frequency Resolution: 10.7666015625 Hz/px

First data byte: 256

Loaded hal.sff (42,920 values) in 27.69 ms

[[0.000e+00 0.000e+00 0.000e+00 ... 0.000e+00 0.000e+00 0.000e+00]
 [0.000e+00 0.000e+00 0.000e+00 ... 0.000e+00 0.000e+00 0.000e+00]
 [0.000e+00 0.000e+00 0.000e+00 ... 0.000e+00 0.000e+00 0.000e+00]
 ...
 [1.374e-05 1.378e-05 1.390e-05 ... 4.056e-05 4.062e-05 4.066e-05]
 [0.000e+00 0.000e+00 0.000e+00 ... 0.000e+00 0.000e+00 0.000e+00]
 [0.000e+00 0.000e+00 0.000e+00 ... 0.000e+00 0.000e+00 0.000e+00]]

All those zeros at the start and end are correct, because the source WAV file truly contains a bit of silence at the start and end of the file.

KJ-Waller commented 4 years ago

Hi Scott. Technically, for my use case complex values are not required.

However, I want to use some data augmentation functions from the Pytorch Torchaudio library, specifically the TimeStretch function. It takes in spectrograms with complex values however.

I was thinking of implementing my own timestretch function that does it without complex values, but I'm not sure if it's as simple as stretching the spectrogram along the time axis, as the Pytorch documentation states that TImeStretch "Stretch[es] stft in time without modifying pitch for a given rate."

Would it be hard to add support for complex values in SFF in the short run, or do you have a suggestion as to how I could implement the timestretch function myself on regular spectrograms without modifying the pitch?

As for sharing the demo, I am doing this project as part of an internship so I will ask my employer if I can share it with you. Thank you again so much for this library and your help. It is much appreciated.

Stay healthy & regards, Kevin

swharden commented 4 years ago

Hey Kevin,

Help me fill in the blanks if I'm missing something, but I think what you describe is stretching the audio signal, not stretching a spectrogram. The difference is that the TimeStretch function returns audio (a 1D series of points), but a spectrogram is 2D data often presented as an image. I suspect they're unrelated topics.

If your goal is to use machine learning on spectrograms (2D data) by treating them as images, I don't see utility in using TimeStretch function on a 1D audio signal. Stretching a 2D spectrogram image is as easy as opening it in MSpaint and clicking resize. While simple, I don't see utility in doing that either though...

What is the ultimate goal of the application you are trying to develop?

KJ-Waller commented 4 years ago

Hi Scott,

The TimeStretch function that I linked to from Pytorch takes in a spectrogram, and outputs a spectrogram. It's used to make the deep learning algorithm more robust to speakers talking at different rates. So it doesn't take in or return a 1D audio signal, but rather takes in and outputs a complex spectrogram.

I also thought that simple stretching a 2D spectrogram would produce the same results, but I do see that the Pytorch docomentation states that it stretches the spectrogram without modifying the pitch. As I'm not sure if simply stretching the spectrogram along the time domain would maintain the pitch, I was hoping I could still use the Pytorch TimeStretch implementation to do the augmentation for me, which is why I would need complex values for the spectrograms.

The ultimate goal is to just see where Speech Emotion Recognition is at in terms of performance in a real time demo application. I have been able to train my deep learning models in Pytorch using the datasets I have available to me, but the spectrograms generated by Pytorch's torchaudio library are hard to get to be the exact same as in the C# demo. This is why I'm trying to instead generate all the spectrograms using your library in C# first, and then train the deep learning models on those spectrograms to make sure the model works in the demo as it should.

I hope this clarifies things.

Regards, Kevin

swharden commented 4 years ago

It's used to make the deep learning algorithm more robust to speakers talking at different rates.

That makes sense 🤔

So it doesn't take in or return a 1D audio signal, but rather takes in and outputs a complex spectrogram.

Thank you for clarifying this point! I definitely misunderstood its function on the first pass.

As I'm not sure if simply stretching the spectrogram along the time domain would maintain the pitch

It actually does! It really is that easy! Just stretching it horizontally, preserves pitch, because pitch (frequency) is on the vertical axis, so not stretching in the vertical axis does not modify frequency. Stretching horizontally modifies the spectrogram, preserving pitch.

Rather than thinking about your task as wrangling complex numbers... I think it really is as simple as just stretching your image horizontally.

Keeping the data complex is probably only useful if your intent is to later use the inverse FFT to create a listenable audio signal from that stretched spectrogram. If you're only working with images though, I think you can drop the complex numbers, and work exclusively with magnitudes and images.

I guess my remaining question is what you're using to train your ML models: are you training with audio signals (1D simple data), complex spectrograms (2D complex data), or spectrogram images (2D simple data)? That answer will determine your workflow. If you intend to train your model with spectrogram images (2D simple data), stretching the spectrogram really is as easy as just stretching the image.

Hope it helps! Scott

KJ-Waller commented 4 years ago

Hi Scott,

Thanks so much for clarifying that. Intuitively it made sense to me that stretching the image would preserve pitch, but the documentation focusing that it preserved pitch made me doubt myself. I'll continue then without complex values.

My deep learning models are trained simply on the 2D spectrograms/images, without the complex values, so I guess with the TimeStretch clarified, will not need complex values. The datasets I'm using are IEMOCAP, RAVDESS, TESS, SAVEE, EMOVO, and EMODB, which are all speech emotion datasets that labels sentences with around 7 emotional states, so the deep learning problem is basically an image classification problem.

Thanks for being so responsive. Regards, Kevin

swharden commented 4 years ago

This is such a cool project! I'd love to learn more how it turns out. Feel free to follow-up again later when you have it all figured out, and if you make it into a publication and/or a GitHub repo I'd love to see how it looks when it's all completed. Feel free to reach out in the mean time if any questions pop up.

Good luck with your project! Scott