Closed ramrahu closed 4 years ago
Hmm... but I see a .pbtxt file created in the checkpoint directory. There must have been some code to create that file. I didn't add anything to export it.
Hi, sorry for the late response. I would assume that the tf.estimator.Estimator
does write those. In that case it's probably configurable somewhere within the tf.estimator.RunConfig
; however I don't know where exactly.
Hmm ok thank you Do you have any paper on this work which I could cite for my work?
If you could cite:
@thesis{Dangschat2018,
author = {Dangschat, Marc},
title = {End-to-End Speech Recognition Using Connectionist Temporal Classification},
type = {Master Thesis},
institution = {Münster University of Applied Sciences},
date = {2018-11-21},
url = {https://github.com/mdangschat/ctc-asr},
}
that would be great. Could you drop me a line when your work is finished? It would be interesting to dabble a bit into TensorFlow on Android.
Yeah sure. Will let you know when my work is finished. Thank you for your help.
Hi, I have a question. I see most speech recognition applications use 16 kHz audio files for training. And from my understanding of your work, I guess even you have done the same. I was wondering why isn't 44.1 kHz and 48 kHz considered for training as they may have more important frequencies that may be lost in 16 kHz?
Hey, some of the reasons why I used 16 kHz training material were:
Does the number of samples per second affect feature extraction? I read somewhere that it might. Like for example, for 44.1 kHz, if we use a 25 ms frame to extract MFCC like its generally done, the number of samples we get isn't a whole number. That may affect MFCC extraction. Not sure in what way it affects though.
I'm not sure how an increased sampling rate affects windowed features like MELSCALE/MFCC. I would assume that is would allow for more precise discrete datapoints. Which could maybe help if you would also use more of them. For example increasing the 80 coefficients (NUM_FEATURES
) that I use.
The window size (and step) should be tuned to fit common speech features like e.g. phonemes.
The increase in computation I mentioned was more in reference of raw data input or a misuse of the librosa API; which I used initially while prototyping this project.
OK thanks. One last question. I didn't understand what you mean my datapoints?
Raw speech data or extracted features (e.g. MFCC) are time series data, which contains a datapoint at every recorded time step.
Hello, I just wanted to know where you are saving the .pbtxt file? I noticed your code creates this graph file but I am not able to locate the code snippet for creating it. Thanks in advance