(2) Pre-process Data - Githubissues

redouane-dziri commented 5 years ago

[x] write a script to load the data into a dict:
- keys are train/test with values a list of triplets (file_name, np.array, genre)
[x] add a segmentation of the audio files to break them into 14 pieces
- should take in the previous dict as input and output a dictionary similar to input with triplets (file_name, np.array, piece_id, genre) with piece_id between 0 and 13. let's call this dict raw_segments
[x] write a script to compute the mel-maps of the data
- should take in raw_segments and replace the np.arrays by the mel maps of the np.arrays
[x] write a script to compute the spectrogram of the data
- should take in raw_segments and replace the np.arrays by the spectrogram of the np.arrays
[x] write a script to quantize the maps
[x] write a script to compute the GLCM with the angle as argument (distance will always be one in our experiments)
- should take in a dict of the form of raw_segments (will be the one of spectrograms or mel maps) and return the GLCM (again the only change is to the np.arrays), and angle argument and distance argument
[x] compute the GLCM of the spectrograms with angles in (0, 45, 90, 135) and d=1
[x] compute the GLCM of the mel-maps with angles in (0, 45, 90, 135) and d=1
[x] check all computed maps at each step
[x] figure out which level of the GLCM to drop (first or last?) and drop it
[x] write a script to run all above transformations to the full_data and store outputs similar on Google cloud storage
[x] also compute the time-MFCC !!
[x] run final script to transform the full data on Google Cloud and save the resulting GLCMs in json format

redouane-dziri commented 5 years ago

From the reference paper, to do the above:

divide the signals into short-term pieces of 4 seconds with 2 seconds overlapping (-> 14 pieces from a signal)
mel maps should be calculated with a frame-length of 125 ms without overlap and filterbank-channels of 40 (ie. the size of the map is 40x32)
the spectrogram should be calculated with a frame-length of 186 ms, with 50% overlap

arnaudstiegler commented 5 years ago

Do you have any tools/packages that we should use? I have seen that you can split audioFile with a package called AudioSegment, and mel maps can be done using librosa I think (melspectogram)

redouane-dziri commented 5 years ago

Have not looked into that at all yet, maybe look in the paper if they mention any packages but nothing comes to mind. I have seen usage of librosa for spectrograms a while back so yes probably the one we'll use for spectrograms + melmaps :))

arnaudstiegler commented 5 years ago

I couldn't find any in the paper, so we will have to choose!

redouane-dziri commented 5 years ago

Let's try librosa for the maps and scikit-image for GLCMs

arnaudstiegler commented 5 years ago

Above merge:

function to read in-memory the sample files
function to split each file with overlapping
function to generate the mel maps for each file

redouane-dziri commented 5 years ago

A couple of issues with the feature engineering that will probably hinder our ability to reproduce results from the paper:

unsure about how to generate the spectrograms, will need to read up on that to verify that what was done is right
the details of the quantization of the paper were not given. Looking at the produced maps, I assumed the best way to quantize was to compute exponential buckets from the min to the max of the original values of the array (highly skewed towards min, so linear bucketing placed nearly all values in the first and second buckets). Our quantization is still probably not an exact match to theirs.
they didn't mention if the GLCMs were normed, or symmetric. I assumed they were not
will need to check which GLCM level to drop: first or last? should be straightforward by checking which is more sparse maybe, if there is a large imbalance?

arnaudstiegler commented 5 years ago

Interesting idea for the quantization! I think it makes total sense, but it would be great to have some litterature references about that!
For GLCMs, I think you're right not to assume anything for now. Experiments might prove us wrong but it is just a minor change anyway
When you talk about GLCM level to drop, you are referring to gray levels right? I think your idea is right though

I'll try to find some litterature about the GLCMs for music since most of our current issues come from our lack of understanding of those matrices!

@redouane-dziri, quick question about the TODO list above: what do you mean by "check all computed maps at each step"? Is is like a tensor shape check? And do you want to implement some functions to test the feature engineering, or just manually checking them before we move on to the training phase?

arnaudstiegler commented 5 years ago

We might have to agree on the way the imports are done for the .py files. There are some unconsistencies between running it with a notebook or directly running the files, so we need to choose the running method for the files to adapt the imports to it

arnaudstiegler commented 5 years ago

I've pushed some changes:

imports were changed to be able to run files by doing python file.py (wasn't the case before)
generate_full_data.py that fetches data from bucket and turn it into a format similar to the format for generate_short_term_pieces_from_dict(). This is highly inefficient since it requires to load the entire data in memory rather than do the preprocessing on the fly
couldn't be able to serialize the np arrays in the output of the preprocessing, need more time on that

Also, I have assumed that the output that we will write to the bucket is a big json, correct me if i'm wrong

arnaudstiegler commented 5 years ago

Finished the python script to preprocess the full data (feature_engineering/preprocess_full_data.py). I have tested it (with break in the blobs for loop), and should be running fine. We need to set up an instance to do that because it is very greedy in terms of memory. feature_engineering/preprocess_full_data.py:

Download all files from the blobs
put everything in a dict which is on par with the output of read_data()
preprocess this data
dump the data locally in json (one json per spectrogram/mel_map angle)

Notes:

To serialize the output, I converted all numpy arrays to lists (using tolist()). So when using the data for training, we will have to have a function that deserializes the lists (np.array() should be enough)
I decided to split the output in multiple jsons because otherwise, it would have been a 5+ Go file and we don't need to all the data at once in memory anyway
Can be easily converted to load the result to blobs if needed
I updated the read_data() function so that it can accept pre-loaded data (and not use read_data() that reads the local sample data)

arnaudstiegler commented 5 years ago

Just finished with the preprocess_full_data.py script. It does end-to-end data preprocessing, and saves the result locally and on Google storage (in json format).

One file (like the file for mel map with angle=45) is 2.76 GB which is pretty heavy. The whole process takes around an hour.

Notes:

just realized I did not take care of the GLCM issue for dropping levels. But we can fix that and re-run the script
the compute time could be significantly reduced by dumping each preprocessed file and deleting it, rather than keeping in memory the whole thing (python peaks at 40Go at some point)

redouane-dziri commented 5 years ago

We forgot about the time-MFCC thing mentioned in the paper as well, yet another pre-processing pipeline to compare to. Added to the checklist.

redouane-dziri commented 5 years ago

The exploration confirms that it's the first bucket (very low dB) that we should drop in GLCM.

redouane-dziri commented 5 years ago

A little issue I took care of: three of the shorter tracks produced only 13 pieces instead of 14 when passed through the short-term pieces part of the pipeline. Now they will produce 14 - by padding with 0 until we reach the minimum size for them to produce 14 pieces (only a few 100s points out of ~660,000 so shouldn't be a problem, and each track is from a different genre, so not too much leaking on that end).

redouane-dziri commented 5 years ago

I wasn't sure about the quantization and the maps but now I am confident we're doing it right-ish. The paper mentions quantizing into 16 levels, which seemed arbitrary to us, non-experts. By converting the maps to decibels, all values belong in the range [-80, 0] dB which is nicely divisible by 16 and produces buckets of 5 dB for quantization, comforting the hypothesis that the maps need to be pre-processed from amplitude realm to dB :) Re-wrote the quantization as a consequence, making sure edge cases were taken care of, and all maps were mapped into buckets from 1 to 16.

arnaudstiegler commented 5 years ago

Just pushed a commit for implementing the MFCC, some notes:

we first compute the MFCC (n_mfcc*n_windows): I turned the n_windows as a parameter because it should be the same for each track regardless of its size
split the MFCC into submap: I used the np.split function for it since we don't need any overlap. But it will throw an error if the number of splits is not correct
for getting the correct length for the mfcc, either we truncate the array or we pad it
the format of the generate_mfcc_from_dict is slightly different than for the other functions: you still get the split, but the tuples have the format (filename, 3D-array, genre)

redouane-dziri commented 5 years ago

I corrected generate_glcm: it was outputing (256x256) gray level maps instead of (16x16). Also corrected generate_glcms_from_dict as it was generating glcms only with angle 0 all the time, due to some tricky variable replacement sh*t - that took a while to figure out ^^. I also had to modify generate_MFCC_from_dict, somehow it wasn't outputting the 30x40x50 arrays we were expecting. Added drop_first_glcm_level_from_dict in the pipeline as well.

redouane-dziri commented 5 years ago

All there should be left to do is to run the pipeline on the full data and we'll be done with this feature extraction preliminary step.

arnaudstiegler commented 5 years ago

@redouane-dziri, I think there is one last step of preprocessing/piping to do which would be to have a function that would extract the data, and format it so that it can be fed to our model. It is pretty straight-forward for the most part. The only thing that is gonna require more work is the i-GLCM which combines all of the angles, and this will require some data engineering because of the data format we chose. I think it might be best to have a json for this as well (and not having to go through the process every time), I'll work on that!

arnaudstiegler commented 5 years ago

So I finished this:

data/preprocessed_data_full contains the full set of preprocessed data
data/preprocessed_data_sample contains the preprocessed sample_data

Few notes:

you may need to install git lfs to pull from the repo
might need some changes for the mfcc (but nothing big)

redouane-dziri / deep-music-classification

(2) Pre-process Data #2