tuanchien / asd

Active Speaker Detection
Apache License 2.0
19 stars 4 forks source link

Speaking/Not speaking annotations for various frame sizes #9

Closed godatta closed 3 years ago

godatta commented 3 years ago

Hi, thanks for the code, I could finally run it. I still had a few questions about the way you annotate, and it would be great if you can kindly answer them.

1) Right now, you assign the label of the last instance in the video frame window as the ground truth. Is this the best to do according to you even if I have a large sized input frame, say 25? What other ways of annotation you might suggest?

2) I see the training/test data size increases with the sequence length and the video frame size. Why is that? Aren’t you iterating over the same number of frames (at fixed fps, the total number of frames should be constant?)

3) What is the relationship between the number of video frames and the sequence length? Like, is 25 frames for 1 sequence same as 5 frames for 5 sequences? If the number of sequences is more than one, how is the labelling being done currently?

Thanks!

tuanchien commented 3 years ago

The annotations are released by Roth et al, at Google Research. https://research.google/pubs/pub49517/ Neither Jamie or I are affiliated with them in any way.

  1. You will have to experiment yourself. You cannot get frames "in the future", so prediction on the last frame is the best you can do unless you are trying to solve a different problem.
  2. I'm not sure what you are trying to ask here. If you increase the sequence length, the amount of data you pass in as a sample point is increased.
  3. If you increase the sequence length, you should be increasing the number of video frames as well. I don't understand your question regarding multiple sequences.