mdangschat / ctc-asr

End-to-end trained speech recognition system, based on RNNs and the connectionist temporal classification (CTC) cost function.
MIT License
120 stars 36 forks source link

Inference garph #12

Closed ramrahu closed 4 years ago

ramrahu commented 5 years ago

Hello, I just wanted to know where you are saving the .pbtxt file? I noticed your code creates this graph file but I am not able to locate the code snippet for creating it. Thanks in advance

mdangschat commented 5 years ago

Hi @ramrahu, if I remember correctly, I didn't use the proto buff export because it was incompatible with the py_func function used to convert and feed the audio files to the network. Depending on which graph you need, you could just try to export it.

ramrahu commented 5 years ago

Hmm... but I see a .pbtxt file created in the checkpoint directory. There must have been some code to create that file. I didn't add anything to export it.

mdangschat commented 5 years ago

Hi, sorry for the late response. I would assume that the tf.estimator.Estimator does write those. In that case it's probably configurable somewhere within the tf.estimator.RunConfig; however I don't know where exactly.

ramrahu commented 5 years ago

Hmm ok thank you Do you have any paper on this work which I could cite for my work?

mdangschat commented 5 years ago

If you could cite:

@thesis{Dangschat2018,
  author = {Dangschat, Marc},
  title = {End-to-End Speech Recognition Using Connectionist Temporal Classification},
  type = {Master Thesis},
  institution = {Münster University of Applied Sciences},
  date = {2018-11-21},
  url = {https://github.com/mdangschat/ctc-asr},
}

that would be great. Could you drop me a line when your work is finished? It would be interesting to dabble a bit into TensorFlow on Android.

ramrahu commented 5 years ago

Yeah sure. Will let you know when my work is finished. Thank you for your help.

ramrahu commented 5 years ago

Hi, I have a question. I see most speech recognition applications use 16 kHz audio files for training. And from my understanding of your work, I guess even you have done the same. I was wondering why isn't 44.1 kHz and 48 kHz considered for training as they may have more important frequencies that may be lost in 16 kHz?

mdangschat commented 5 years ago

Hey, some of the reasons why I used 16 kHz training material were:

  1. The number of samples per second of recording. An increased SR contains more datapoints, therefore a RNN would require additional unrolls.
  2. To be comparable with other models, which mostly use 16 kHz.
  3. I merged several smaller datasets into a larger one, and didn't want to upsample. So I stayed with the lowest SR.
ramrahu commented 5 years ago

Does the number of samples per second affect feature extraction? I read somewhere that it might. Like for example, for 44.1 kHz, if we use a 25 ms frame to extract MFCC like its generally done, the number of samples we get isn't a whole number. That may affect MFCC extraction. Not sure in what way it affects though.

mdangschat commented 5 years ago

I'm not sure how an increased sampling rate affects windowed features like MELSCALE/MFCC. I would assume that is would allow for more precise discrete datapoints. Which could maybe help if you would also use more of them. For example increasing the 80 coefficients (NUM_FEATURES) that I use.

The window size (and step) should be tuned to fit common speech features like e.g. phonemes.

The increase in computation I mentioned was more in reference of raw data input or a misuse of the librosa API; which I used initially while prototyping this project.

ramrahu commented 5 years ago

OK thanks. One last question. I didn't understand what you mean my datapoints?

mdangschat commented 5 years ago

Raw speech data or extracted features (e.g. MFCC) are time series data, which contains a datapoint at every recorded time step.