Total training samples - Githubissues

leichtrhino / ChimeraNet

Unofficial implementation of music separation model by Luo et.al.

MIT License

13 stars 2 forks source link

Total training samples #3

Closed prashant45 closed 5 years ago

prashant45 commented 5 years ago

Hi,

I have a question about the data_generator.py, generate_test_data function.

How do you calculate the number of steps (samples)=7200 in your training script using this function. The generator function for keras requires you to know the number of samples/steps per epoch.

How can I use this to calculate the samples for a different dataset ? Also, how can I calculate the same for a validation set?

Any help for the understanding would be appreciated.

leichtrhino commented 5 years ago

Hello,

The number of training samples per epoch is

batch_size * steps_per_epoch ( in fit_generator ),
[--batch-size] * [--steps] ( in chimeranet-train.py ).

Also, the number of validation samples per epoch is

batch_size * validation_steps ( in fit_generator ),
[--batch-size] * [--validation-steps] ( in chimeranet-train.py ).

prashant45 commented 5 years ago

Hi,

I understand the parameter of steps_per_epoch provided in fit_generator function of keras.

My question was regarding the default parameter for [--steps] = 7200 // 8. How do you choose/calculate 7200 being the total samples in this case. I know 8 is your batch size.

Based on your function, https://github.com/arity-r/ChimeraNet/blob/389fb54ad9b68c77ab99875c4babc443af68904e/data_generator.py#L57-L66

you randomly select a file for vocal, melody and load only 0.5 second of the audio ( your default parameter for [--duration] ), to create one sample for training/validation after mixing them.

Is your while loop of train_generator or validation_generator creating a unique sample using generate_one() from the dataset, during an epoch?

How can I calculate the number of unique samples in my dataset, for instance if I have 10 melody, vocal files each 2 min long and I set [--duration] = 1.

I am sorry if its dumb question, but I am new to speech data.

leichtrhino commented 5 years ago

I choose 7200 as the number of samples with no reason. If I pick 7200 0.5-second-samples, it would be 1 hour per epoch.

Is your while loop of train_generator or validation_generator creating a unique sample using generate_one() from the dataset, during an epoch?

Yes unless the function choose same vocal and melody file and pick same ranges and mix them in same power level.

How can I calculate the number of unique samples in my dataset, for instance if I have 10 melody, vocal files each 2 min long and I set [--duration] = 1.

Almost infinetly many. Some are similar depend on how the function mix vocal and melody.

I use fit_generator as I can generate infinetly many samples. I only can choose the number of samples, not calculate the number of samples.

Actually, I don't know I'm doing it right. I hope it could be your help.

prashant45 commented 5 years ago

Hi,

Thanks for the clarification. Yes, the loop is much clear now.

Although in general I believe, during an epoch the network should get unique samples from the dataset. Probably it wouldn't matter if the data is overlapped for couple of milliseconds for 2 samples.

With your loop, it might happen that some data might not be seen by the network even though it is present in the dataset. Or, two samples will have more than 50 % overlap.

Anyways, thanks for the help. :D