How can you extract vision features?

Hi, Thanks to share good code.

I have some questions.

How can you get a DenseNet model with trained FER+ datasets? Did you fine-tuning your own? If you do it, can you share extraction model?
How to extract vision feature in video data?. In paper, you use densenet to extract vision feature. So I wondering about how to extract in video datasets. Did you use only one sample data to get feature? or use time series frame data?
Is this any plan to share code about extract all(text, vision, audio) feature?

Thank you

hujingwen6666 / MMGCN