Dear authors,
Thank you for your amazing work firstly. After reading your paper, I found the proposed method uses CNN get the feature map after inputing an image and then predict the top-left and bottom-right keypoint of bounding box, which needs a grouping operation for detected keypoints that may don't work well for crowded people. I wonder whether we can predict the coordinate of the center of bounding box and regress the height and width such that there is no need for following grouping operation.
Dear authors, Thank you for your amazing work firstly. After reading your paper, I found the proposed method uses CNN get the feature map after inputing an image and then predict the top-left and bottom-right keypoint of bounding box, which needs a grouping operation for detected keypoints that may don't work well for crowded people. I wonder whether we can predict the coordinate of the center of bounding box and regress the height and width such that there is no need for following grouping operation.
Thanks.