Open AwePhD opened 1 year ago
I udpate the issue with a workable approach that I used. I will give a repo and try a PR if it interests mmdet to have this task.
Basically, there is a data sample abstraction to build the annotations properly. it is very similar of the detection data sample. One sample is a frame (like detection sample) and one frame has dataInstance (which will be described). Those instances only have labels and detections data as annotations (like regular detection instance data) without all the fields for segmentation. Then we can add another field (labels_reid) for the person IDs, similar of the labels used for detection categories. Then we have our data samples for the Detection-ReID (D-ReID) task. (There was the main lines). We must implement its transforms (loading and formatting), once again the code is very close of the detection transforms. Plus, if our instancedata for D-ReID have the fields necessary for the detection task, the same samples can be used by a mmdet detector. After this, we can implement a dataset from BaseDataset of mmengine. We just have to write a script to transform the annotations sources of D-ReID to the openmmlab format. Then the D-ReID dataset is almost identical to the BaseDataset of mmengine, great.
Then we can implement the D-ReID model. For example, as illustration, we can do a base class for d-reid model with a detector inside it and a reid module. Then we can take a detector (DeformableDETR) and to write our reid model. Then there are some works to do to gracefully handle the inference / loss from the detector, reid model and the overall d-reid model. Basically, we compute the loss from each task in their model (reid and detector) and if we have a loss that requires both output (reid output and detection output) the computation of the loss can be done in the d-reid model.
Then we just have to implement a new metric. Which will be dependent of the database. For example CUHK-SYSU has a specific query / gallery structures.
It's all the main lines for the implementation. I am almost done with a PSTR implementation in few days/weeks. I will try a PR if mmdet is interested for this kind of task (and validate the re usability of my code I guess).
Note: In the initial post I wrote about the use of MOT. This is not a working approach. MOT models has frames as input. They only need the Video abstraction for their tracker part actually. In the D-ReID task, we do not need to pack frames by person IDs. The reason is that their might be multiple labeled (detections with a person ID) in a single frame. So, if we have a "person ID" sample is used by packing frames by person ID, multiple frames will appear more than once across multiple "person ID' samples. Use an alternative data sample of the detection task is more straightforward, we only add a label for person ID in the InstanceData.
Hello,
I'm currently exploring the Detection ReID task, which is distinct from the MOT task but has strong ties to the detection task. Notably, implementations like PSTR and AlignPS have utilized earlier versions of mmdet. I'm seeking clarity on whether the tracking paradigm is suitable for training and testing a Detection ReID model or if a custom approach is more appropriate.
Brief Overview of Detection ReID Task: The Detection ReID task adopts the standard ReID query/gallery paradigm. Given a query frame with an individual of interest and a collection of gallery frames containing various people, the detection component extracts detections from each frame. Subsequently, the person of interest from the query frame is isolated, and detections from the gallery frames are filtered based on confidence scores. Features are extracted from each detection (both query and gallery) using a ReID model. The final step involves computing similarities between the query features and each gallery feature. Sorting these similarities reveals the most likely matches for the person of interest.
Having examined the MOT task's implementation, I'm uncertain if leveraging the MOT tracker and ReIDDataSample would be an efficient approach for the Detection ReID task. One primary distinction is that while the tracking paradigm assigns labels within a single video, the gallery frames in Detection ReID originate from disparate cameras. This distinction makes me skeptical about fitting the Detection ReID paradigm under the MOT tracking task framework.
I'd greatly appreciate insights and recommendations from the development team regarding this. Is it feasible to adapt the existing tracking paradigm for this, or should a distinct approach be considered?
Thank you in advance. Mathias.