The audio processing methods mentioned in the paper are as follows: 1、Extracting features directly from the entire audio segment, and 2、Segmenting the audio and extracting features from each segment separately. However, in the UDVIA dataset, each video segment consists of a dialogue between two individuals. When extracting features from the entire audio segment, should the presence of different speakers be taken into account?
The audio processing methods mentioned in the paper are as follows: 1、Extracting features directly from the entire audio segment, and 2、Segmenting the audio and extracting features from each segment separately. However, in the UDVIA dataset, each video segment consists of a dialogue between two individuals. When extracting features from the entire audio segment, should the presence of different speakers be taken into account?