rohitgirdhar / CATER

CATER: A diagnostic dataset for Compositional Actions and TEmporal Reasoning
https://rohitgirdhar.github.io/CATER/
Apache License 2.0
103 stars 19 forks source link

bounding box annotations #7

Closed roeiherz closed 4 years ago

roeiherz commented 4 years ago

Hi,

Is there any way to extract the bounding box annotations per frame?. I succeeded to extract (cx,cy) per box but how can I calculate the width and height of the box?

Many thanks,

rohitgirdhar commented 4 years ago

Thanks for your interest! Unfortunately there's no easy exact solution for this at the moment, however we do have a method to extract an approximate bounding box by estimating a homography between the camera plane and the CATER plane (for the fixed camera case), to initialize our tracker baselines. I'd recommend taking a look here for more information.

roeiherz commented 4 years ago

Thanks for you quick response! I did use your method to extract the center of the bounding box: (cx, cy), but I'm not sure how do you extract the width and height of the box.

It seems that the tracker is used solely for the snitch, right?. The snitch initialized with target_sz =np.array([30, 30]) in the first state and I'm not sure if that's the case for all other objects (or I missed something?).

Is there any approximation to estimate the width/height for each object at each frame?.

rohitgirdhar commented 4 years ago

Ah yes the tracker is only for the snitch. I guess one quick and dirty way would be to similarly define some approximate dimensions for each object based on their size (small/medium/large etc), and verifying it visually in a few frames.

rohitgirdhar commented 4 years ago

Updating the issue with follow up email conversation

As discussed above, unfortunately there's no exact way to extract this information currently (one could potentially re-render the data and at each frame store the segmentation masks for each object by rendering them separately for eg, but would be computationally expensive). However here are some more details on how one could hack their way to an approximate box (similar to how I do it to initialize the tracker):

When the object is on the ground (eg at the beginning of the video), you could approximately get its position similar to how I do for initializing the tracker (it's done for snitch, but you could use the same homography trick to transform the position of any object from CATER plane to the image plane). When it is in the air (eg when being picked up and placed, the Z position of the object would be higher), it will be a bit more tricky, since you would have to compute a new homography matrix for that higher plane and the image plane (though you can do it similar to how I do it here for the bottom plane). Then you can transform that object's position from any 3D plane to the image plane, and get it's Cx, Cy.

For the bounding boxes width/height, you could use a heuristic -- eg, define the 3D coordinate of the top of the object through a heuristic of its height (the heuristic could be based on the "sized" parameter we have for each object in the JSON -- it's basically inherited from CLEVR and corresponds to the small/medium/large object type). Then that coordinate can be similarly transformed to the image plane, which will give you 2 points on the object -- cx,cy, and the x,y of the top most point. Then you can use some heuristic based on those 2 points to compute an image plane 2D bounding box over the object.