Guidance on Implementing Detection ReID Task

I udpate the issue with a workable approach that I used. I will give a repo and try a PR if it interests mmdet to have this task.

Basically, there is a data sample abstraction to build the annotations properly. it is very similar of the detection data sample. One sample is a frame (like detection sample) and one frame has dataInstance (which will be described). Those instances only have labels and detections data as annotations (like regular detection instance data) without all the fields for segmentation. Then we can add another field (labels_reid) for the person IDs, similar of the labels used for detection categories. Then we have our data samples for the Detection-ReID (D-ReID) task. (There was the main lines). We must implement its transforms (loading and formatting), once again the code is very close of the detection transforms. Plus, if our instancedata for D-ReID have the fields necessary for the detection task, the same samples can be used by a mmdet detector. After this, we can implement a dataset from BaseDataset of mmengine. We just have to write a script to transform the annotations sources of D-ReID to the openmmlab format. Then the D-ReID dataset is almost identical to the BaseDataset of mmengine, great.

Then we can implement the D-ReID model. For example, as illustration, we can do a base class for d-reid model with a detector inside it and a reid module. Then we can take a detector (DeformableDETR) and to write our reid model. Then there are some works to do to gracefully handle the inference / loss from the detector, reid model and the overall d-reid model. Basically, we compute the loss from each task in their model (reid and detector) and if we have a loss that requires both output (reid output and detection output) the computation of the loss can be done in the d-reid model.

Then we just have to implement a new metric. Which will be dependent of the database. For example CUHK-SYSU has a specific query / gallery structures.

It's all the main lines for the implementation. I am almost done with a PSTR implementation in few days/weeks. I will try a PR if mmdet is interested for this kind of task (and validate the re usability of my code I guess).

Note: In the initial post I wrote about the use of MOT. This is not a working approach. MOT models has frames as input. They only need the Video abstraction for their tracker part actually. In the D-ReID task, we do not need to pack frames by person IDs. The reason is that their might be multiple labeled (detections with a person ID) in a single frame. So, if we have a "person ID" sample is used by packing frames by person ID, multiple frames will appear more than once across multiple "person ID' samples. Use an alternative data sample of the detection task is more straightforward, we only add a label for person ID in the InstanceData.

open-mmlab / mmdetection

Guidance on Implementing Detection ReID Task #10913