mikel-brostrom / boxmot

BoxMOT: pluggable SOTA tracking modules for segmentation, object detection and pose estimation models
GNU Affero General Public License v3.0
6.79k stars 1.72k forks source link

Possibility of reducing the acquisition of embeddings of objects. #1430

Closed ozayr closed 5 months ago

ozayr commented 6 months ago

Search before asking

Question

Just a thought, I wonder if it's possible to not have to generate embeddings for objects on each frame if we have some kind of certainty that this object is the object tracked from the previous frame. not sure if what im saying makes sense. but this could significantly increase the speed of the tracking part of the pipeline.

mikel-brostrom commented 6 months ago

This is a valid point @ozayr. It highly depends on the deployment environment. Shadows/no shadows being cast in the environment, appearance from one angle may look different from another (lets say somebody carries a backpack), how objects move within the environment (they may move away from the camera) changing the amount of details that the camera can take up due to resolution...

Lets take up a specific example. If somebody enters the field of view of the camera where a heavy shadow is cast, what is visible about this person may be half a body, let's also assume that this person carries a backpack. After walking a bit, this person with full body visible may turn around facing the camera. Then the first captured embedding would not be representable of this person at all.

ozayr commented 6 months ago

I'm thinking, once an object say the person with the back pack has been assigned an ID

github-actions[bot] commented 6 months ago

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs. Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

mikel-brostrom commented 6 months ago

Dropping some ideas here:

For every n frames, we rely entirely on motion-based tracking algorithms (such as Kalman Filter, optical flow, or other predictive models) to estimate the object's location. After n frames, we generate a new embedding to revalidate the identity of the object. This approach can significantly reduce the computational load, as embedding generation is usually the most resource-intensive part in tracking. The risk here is that if the object's appearance changes significantly within those n frames (due to occlusion, lighting changes, or orientation changes), the tracker might lose accuracy.

Introduce a lightweight neural network model to perform quick re-identification checks. This network can be less complex than the main embedding generator but sufficient to catch obvious mismatches. Fallback Mechanism: If the lightweight model signals a potential mismatch, the system can fall back to generating a full embedding for a more thorough check.

Maintain a buffer of recent embeddings and motion vectors. Use these historical data points to ensure that the object’s identity remains consistent over time, reducing the need for constant re-embedding.

Each of these approaches comes with trade-offs in terms of complexity, computational savings, and potential loss of accuracy. It’s important to benchmark these methods in the specific deployment environment to understand their impact fully. Experimenting with a combination of these strategies might yield the best results in balancing efficiency and accuracy.

ozayr commented 6 months ago

I have also seen that Nvidia have dropped SV3DT this is something I have been thinking about for a while, occlusions are my worsed enemy.

if one uses something like UCMC track uses to estimate the camera parameters and then track based on projections to the ground plane, this should also help with tracker accuracy assuming all objects one would like to track are confined to the same ground plane which most of the time is the case.

mikel-brostrom commented 6 months ago

Yes, it would be interesting to provide the option to feed a camera configuration file in order to convert the 2D object to 3D. Then it would be possible to do motion tracking on the ground plane, which according to UCMC is more reliable.

github-actions[bot] commented 5 months ago

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs. Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

ozayr commented 5 months ago

just leaving this here

https://github.com/VlSomers/bpbreid

github-actions[bot] commented 5 months ago

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs. Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!