Speaking/Not speaking annotations for various frame sizes

tuanchien / asd

Active Speaker Detection

Apache License 2.0

19 stars 4 forks source link

Hi, thanks for the code, I could finally run it. I still had a few questions about the way you annotate, and it would be great if you can kindly answer them.

1) Right now, you assign the label of the last instance in the video frame window as the ground truth. Is this the best to do according to you even if I have a large sized input frame, say 25? What other ways of annotation you might suggest?

2) I see the training/test data size increases with the sequence length and the video frame size. Why is that? Aren’t you iterating over the same number of frames (at fixed fps, the total number of frames should be constant?)

3) What is the relationship between the number of video frames and the sequence length? Like, is 25 frames for 1 sequence same as 5 frames for 5 sequences? If the number of sequences is more than one, how is the labelling being done currently?

Thanks!

tuanchien / asd

Speaking/Not speaking annotations for various frame sizes #9