Closed dmascali closed 2 years ago
Sorry about the confusion in the paper. We just refer 'real time' as the inference efficiency (inference time for a single frame). I believe what you mentioned is about online detection/learning. There are works that investigate such problems such as 'Not only Look, but also Listen: Learning Multimodal Violence Detection under Weak Supervision' from Wu et al.
Hi, in the paper you wrote that your system can achieve good real-time detection in real-world applications.
I am wondering how I could use it in real-time given that the MTN module has to process all the video snippets at once, even in test mode. Do you think it is possible to obtain a score for a single feature set (i.e., 16 frames) while still benefiting from the trained MTN module? If so, could you please provide more details?
Thank you!