smeetrs / deep_avsr

A PyTorch implementation of the Deep Audio-Visual Speech Recognition paper.
MIT License
211 stars 41 forks source link

Details of visual fronted model #39

Closed gmh8000 closed 3 years ago

gmh8000 commented 3 years ago

I noticed that the open source model in Ref 1 is based on the TensorFlow framework, have you converted this open source model to the PyTorch framework? In addition, the model in Ref 1 was trained with the visual front end and the back end together, did you cut the back end before and only retain the weight and structure of the visual front end?

gmh8000 commented 3 years ago

Also, I notice that reference 1 doesn't open source the visual front end it uses; This visual front end may be from another article, the title is Combining Residual Networks with LSTMs for lipreading. Did you directly use the model in this article as the visual front end?

smeetrs commented 3 years ago

Yes for both the questions in first comment.

Ref 1 has provided the links to its models (lip model and language model) in a shell script. I have taken the weights from those links.