keunwoochoi / music-auto_tagging-keras

Music auto-tagging models and trained weights in keras/theano
MIT License
616 stars 142 forks source link

How do I train my CNN with my own dataset #20

Closed mv00147 closed 6 years ago

mv00147 commented 7 years ago

Dear Scott, I wish to train the CNN presented by Keunwoochoi with my dataset which is available in the path_1 specified in this code. They are .wav files. How do I represent it in a suitable format so it can be input to the network for training? This is my code. Unfortunately most tutorials on the web only explain training CNN on image data and not audio.

from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers.convolutional import Convolution2D, MaxPooling2D, ZeroPadding2D
from keras.optimizers import SGD, RMSprop
from keras.utils import np_utils
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import os
import theano
import wave
from numpy import *
from sklearn.utils import shuffle
from sklearn.cross_validation import train_test_split
from keras.layers.normalization import BatchNormalization
from keras.layers.advanced_activations import ELU, PReLU
from keras.utils.data_utils import get_file
from keras.models import Model
import time
from keras import backend as K
import audio_processor as ap
import pdb
import soundfile as sf
path_1 = 'C:\\Users\\Admin\\keras cnn tutorial\\input_data'
path_2 = 'C:\\Users\\Admin\\keras cnn tutorial\\input_data_resized'

listing = os.listdir(path_1)
num_samples=size(listing)
print(num_samples)
for file in listing:
    data, sr = sf.read(path_1 + "\\" + file)
    data_1=numpy.fromfile(file, dtype=float, count=-1, sep='')
    data_f=data.resize(200,200)
keunwoochoi commented 7 years ago

@mv00147 , I edited your comment for markdown formatting, it will makes everyone happy!

keunwoochoi commented 7 years ago

A common way is to convert it to melspectrogram. See my audio processing code, except I made a mistake of applying **2 on the top of power-melspectrogram, which is already squared representation of time-freq representation. Then make each sample a 3-d representation - height, width, and channel, where num_channel will be 1 if you're using mono signal.

drscotthawley commented 7 years ago

@mv00147 In my "build_datasets()" method in my training script, I do like @keunwoochoi suggested and make mel-spectrograms:

            aud, sr = librosa.load(audio_path, sr=None)
            melgram = librosa.logamplitude(librosa.feature.melspectrogram(aud, sr=sr, n_mels=96),ref_power=1.0)[np.newaxis,np.newaxis,:,:]

This melgram is basically a 2-d image. Then you can add it to the list of training data you're setting up in your "X" array:

X_train[train_count,:,:] = melgram

where train_count is just an increment each time you load a new audio file.

Does that help?

keunwoochoi commented 6 years ago

(hopefully it did!)