I recently read your amazing paper and am interested in exploring improved methods for intent recognition based on its content. To do this, I believe it's crucial to first experience the end-to-end pipeline of this paper's experiment. I appreciate that the code for the experiment has been generously made available, but I also want to experience the process of feature extraction for Raw Audio, Raw Video, and Raw Text.
Would it be possible for you to share the pre-trained model checkpoint that was used in writing this paper and for the related experiment? I noticed in the paper that Bert-base-uncased was used for the Text modality's Feature Extractor, but it seems the details for other modalities have not been disclosed, hence I am leaving this issue.
Thank you once again for writing and sharing such a paper & implementation code
I recently read your amazing paper and am interested in exploring improved methods for intent recognition based on its content. To do this, I believe it's crucial to first experience the end-to-end pipeline of this paper's experiment. I appreciate that the code for the experiment has been generously made available, but I also want to experience the process of feature extraction for Raw Audio, Raw Video, and Raw Text.
Would it be possible for you to share the pre-trained model checkpoint that was used in writing this paper and for the related experiment? I noticed in the paper that Bert-base-uncased was used for the Text modality's Feature Extractor, but it seems the details for other modalities have not been disclosed, hence I am leaving this issue.
Thank you once again for writing and sharing such a paper & implementation code