General question about data

Amazing repo and very interesting work. I have some questions about the data formatting to introduce 3D data. This may sound very simple so I apologize in advance.

When building a data batch, joints are pixel space (x,y) are taken in addition to their root relative location (location - root) in camera coordinates. So for the root the z = 0. I'm assuming this z is in mm.

Then these data are mapped to the output space, by normalizing the depth dimension from 0-1 (with 0 being -1000 mm, and 1 being +1000 mm), then multiplying by the output space, being 64 (in the depth). The 64 being the heatmap depth dimension.

So on on the forward pass during inference, the output heatmaps in the depth are multiplied back to mm using the opposite conversion, and then corrected according to the true root (calculating using rootnet).

If this is all true (which it may not be, please correct me where wrong), how is the depth accommodated with smaller versus larger images/people? Is this implicit in the training?

Thank you again for your amazing contribution.

mks0601 / 3DMPPE_POSENET_RELEASE

General question about data #105