visionml / pytracking

Visual tracking library based on PyTorch.
GNU General Public License v3.0
3.2k stars 604 forks source link

Is it possible to Track multi-targets in a single inference? #8

Closed dongfangduoshou123 closed 5 years ago

dongfangduoshou123 commented 5 years ago

Given a reference frame with multi-targets to track, is it possible to track them use ATOM by only a single forward net pass?

goutamgmb commented 5 years ago

Although not implemented currently, it should be possible to share the feature extraction for multiple targets. With the current implementation, you will have to track each target independently, i.e. one forward pass for each target.

dongfangduoshou123 commented 5 years ago

Thank you!
If this is the case, the real-time tracking of multi-target can also be guaranteed, which is essential for many practical scenarios, That's very good, Great work!

dongfangduoshou123 commented 5 years ago

I found in ATOM at tracking stage, The extrackted backbone feature is the target-specific Area(5 times the estimated target size) not the whole image, so the share feature extraction is refer to share feature for iou_net_target_postion_estimation and target classcification for single tracked target. so if track multi-targets, each target should do the backbone feature extraction independtly, is it right? @martin-danelljan @goutamgmb Thank you!

dongfangduoshou123 commented 5 years ago

I am most concerned about its real-time performance after multi-target-track extension. Is it possible to extract the backbone feature of whole image first for all tracked targets to share, then take sub_feature of each target, and each target's sub_feature share for iou_net and online classification, I think only so, the real-time tracking of multi-target can also be guaranteed.

martin-danelljan commented 5 years ago

This is one way it could be implemented. First of all, everything could be done in the global image coordinate frame, instead of cropping an image patch as done now.

  1. Extract features over the whole image.
  2. To train the target classification component, construct label maps in the global image coordinates and train the model for each object individually. (Alternatively you could crop a part of the extracted feature map and do training that way.)
  3. To use the target classifier, apply the filters for all objects over the feature map for the entire frame.
  4. To do target estimation. First apply the common conv layers in the iou predictor. Then do the PrPool in the global coordinates. Then predict the final IoU.

Of course, we cannot guarantee real-time performance for multi-object tracking, because we have not implemented or tried it. I think it can be done fairly efficiently however, depending on how many objects you want to track. You are very welcome to try something like this, and report on your findings.

dongfangduoshou123 commented 5 years ago

This is one way it could be implemented. First of all, everything could be done in the global image coordinate frame, instead of cropping an image patch as done now.

1. Extract features over the whole image.

2. To train the target classification component, construct label maps in the global image coordinates and train the model for each object individually. (Alternatively you could crop a part of the extracted feature map and do training that way.)

3. To use the target classifier, apply the filters for all objects over the feature map for the entire frame.

4. To do target estimation. First apply the common conv layers in the iou predictor. Then do the PrPool in the global coordinates. Then predict the final IoU.

Of course, we cannot guarantee real-time performance for multi-object tracking, because we have not implemented or tried it. I think it can be done fairly efficiently however, depending on how many objects you want to track. You are very welcome to try something like this, and report on your findings.

"""4. To do target estimation. First apply the common conv layers in the iou predictor. Then do the PrPool in the global coordinates. Then predict the final IoU""" means dose not need to get sub_feature for each tracked target?

I think the modulation vector should be per target per modulation vector manner, so per target per sub_feature should be geted from the whole feature map to generate the their modulation vectors.

martin-danelljan commented 5 years ago

The modulation vector needs to be target specific. Im not sure what you mean with the sub feature. Anyway, the first step in the feature extraction in the iou predictor can be shared across objects, as i mentioned earlier.

dongfangduoshou123 commented 5 years ago

The modulation vector needs to be target specific. Im not sure what you mean with the sub feature. Anyway, the first step in the feature extraction in the iou predictor can be shared across objects, as i mentioned earlier.

After the first step in the feature extraction in the iou predictor, each target do the PrPool in the feature map region(sub feature) corresponding to target bbox in orignal image to get the target specific modulation vector? then each target do predict the final IoU independly?

martin-danelljan commented 5 years ago

Yes.

momo1986 commented 5 years ago

Hello, @martin-danelljan , Martin.

Thanks for your sharing.

I have tried your open-source algorithm.

It works smoothly with single object tracking.

However, I have some problems for multiple object tracking (e.g., 2):

1) Instantiating multiple trackers and add it into tracker-list, during the inference time, the tracker will updated separately, it can keep the accuracy, but it slow down the speed.

2) I am not sure whether you atom tracker can be added into cv2.MultiTracker_create(). I have tried this flow so that multiple object tracker can inference only once. Here is the code:

     if optional_box is not None:
                assert isinstance(optional_box, list, tuple)
                assert len(optional_box) == 4, "valid box's foramt is [x,y,w,h]"
                for i in range(len(tracker_list)):
                    tracer_list[i].initialize(frame_disp, optional_box)
                    multi_tracker.add(tracker_list[i], frame_disp, optional_box)
            else:
                while True:
                    # cv.waitKey()
                    frame_disp = frame.copy()

                    cv.putText(frame_disp, 'Select target ROI and press ENTER, two or more', (20, 30), cv.FONT_HERSHEY_COMPLEX_SMALL,
                                1.5, (0, 0, 0), 1)
                    cv.putText(frame_disp, str(len(tracker_list)), (40, 50), cv.FONT_HERSHEY_COMPLEX_SMALL,
                                1.5, (0, 0, 0), 1)

                    for i in range(len(tracker_list)):
                        x, y, w, h = cv.selectROI(display_name, frame_disp, fromCenter=False)
                        init_state = [x, y, w, h]
                        tracker_list[i].initialize(frame, init_state)
                        multi_tracker.add(tracker_list[i], frame_disp, init_state)
                    break
      state = multi_tracker.track(frame)

It reports the error, shows that this cannot be supported currently:

Traceback (most recent call last): File "run_video_multiple_object.py", line 40, in main() File "run_video_multiple_object.py", line 36, in main run_video(args.tracker_name, args.tracker_param,args.videofile, args.optional_box, args.debug) File "run_video_multiple_object.py", line 24, in run_video trackerList.run_video(videofilepath=videofile, optional_box=optional_box, debug=debug) File "/fast/junyan/Tracking/pytracking/pytracking/evaluation/TrackerList.py", line 81, in run_video track_videofile_multiple(videofilepath, self.trackerList, optional_box) File "/fast/junyan/Tracking/pytracking/pytracking/tracker/base/basetracker.py", line 83, in track_videofile_multiple multi_tracker.add(tracker_list[i], frame_disp, init_state) TypeError: Expected cv::Tracker for argument 'newTracker'

Thus, my question is how to make multiple-object tracker with atom algorithm not decay in speed?

Thanks if you can answer the question.

Regards!

momo1986 commented 5 years ago

@martin-danelljan , thanks for your potential reply. Also, for multiple objects tracker with tracker list, it will trigger this error image I am not sure what is the root cause for this failure. How can I resolve it? Regards!

dongfangduoshou123 commented 5 years ago

With current ATOM implement, If you want to track 5 targets in a frame, I think you must do 5 times backbone feature extraction independently, the time cost will be Linearly grow, because The extracted backbone feature is the target-specific Area not the whole image, can not share between targets, so must one inference Corresponding to one target. This is my understand of ATOM.

In addition, the ATOM's online classification model output score-raw's post processing (exclude online classification model training which is very detailed) seems to be no explanation in paper, so the process from score-raw 1×1×18×18 to final 1×1×288×288 score's principle in localize_target function is some hard to understand.

martin-danelljan commented 5 years ago

Hi @momo1986 Thanks for your interest. A naive multi-object tracker should be easy to implement using the functionality in the ATOM class. I have no idea about the opencv MultiTracker functionality. Just make a wrapper yourself that loops over all objects in each frame and calls the initialize() and track() functions accordingly. The only think that might require some attention is sharing the network weights across all objects, so that you dont have one network in memory for each object. You should be able to do this by sharing the params.features for all objects.

@dongfangduoshou123 You are right that the upsampling of the scores are not described in the paper due to space limitation. This is not an important element, and the tracker does very well without it (if you e.g. want to optimize it further).

We have no current plans of extending pytracking to multi-target. But we would welcome any contribution that neatly adds simple multi-target tracking functionality to the current structure.

dongfangduoshou123 commented 5 years ago

Thank you for your reply! The tracker does very well without upsampling of the scores? but localize_target and localize_advanced both use the upsample result to get the translatevec. It will be good if you could add a function in ATOM class to show how to get the classification model's final translatevec with out upsample the score_raw(function maybe named as localize_target_without_score_upsampling) thank you!

I mainly cannot understand this code(due to the paper has not mention) :

Convert to displacements in the base scale

disp = (max_disp + self.output_sz / 2) % self.output_sz - self.output_sz / 2 I think disp = max_disp - self.output_sz / 2, but after this edit, the tracker will not work.

martin-danelljan commented 5 years ago

Hi. If you plot the scores at that stage, you will understand why that is needed.

dongfangduoshou123 commented 5 years ago

OK, I will try. Thank you mogul!

dongfangduoshou123 commented 5 years ago

I have an idea, may be not feasible. Kalman filtering + ATOM's online classification component, ATOM's online classification component provide rough location, then use kalman to correct the position and hw. classification component's input feature may be not deep feature, use fast Artificial Designed feature. may be this plan the real-time tracking of multi-targets can also be guaranteed.

momo1986 commented 5 years ago

Hello @martin-danelljan , Martin.

My current workaround is instantiating multiple ATOM tracker and do inference every time.

My understanding is that this tracker is the state-of-the-art single-object tracker.

For other multiple objecting training, maybe a specific dataset with video-frame sequences is needed.

I am not sure whether my presentation is correct.

Thanks for your guidance. Regards!

martin-danelljan commented 5 years ago

Hi. Yes that sounds like a good way of implementing it. Regards, Martin