Open filmo opened 6 years ago
I personally have worked on what you are suggesting but for generative methods where having a prior detection a 3D tracker can use low level motion queues in order to focus resources on parts of the image that most need them. However the great thing about discriminative methods is that they don't have to rely on heuristics but are so powerful that they can detect them with no priors! :) If you think about it you could have a DNN that receives bounding boxes from a past detection plus a new image could maybe improve them (?) but I would guess that it would devolve to some sort of normalization of 2 consecutive yolo runs ..
Has any work been done to condition the current frames on the predictions of prior frames? My understanding is that of right now, each frame is its own inference with no prior. (Essentially each frame is treated as i.i.d)
Seems like greater minds than mine might be able to create a network to condition based on the prior to help stabilize the size and 'flicker' of the bounding boxes. While natural images general scale and rotate, they generally don't appear and disappear from frame to frame. I see that some work has been done with Spatiotemporal Sampling Networks and Flow-Guided Feature Aggregation to address these issues with regards to video tracking. I'm wonder if Yolo (or a derivative) has been extending in such a way?