umautobots / GTAVisionExport

Code to export full segmentations from GTA
MIT License
126 stars 42 forks source link

Coordinate systems and transformation matrices #17

Closed racinmat closed 6 years ago

racinmat commented 6 years ago

In the dumpData method you calculate viewMatrix and projectionMatrix which is then stored to database along with other snapshot data. Both are 4x4 matrices, but when I see the values, projection matrix has all elements non-zero, but view matrix has last row [0, 0, 0, 1] in all cases. And here starts my confusion. View matrix seems like representing rotation and translation in homogenous coordinates, and I do not understand between which coordinates systems it performs the transformation. Also, I thought projection matrix reduces the dimansion, and projects object on some hyperplane (to space with lower dimension), but since the projection matrix is 4x4 with all elements non-zero, it does not look like performing projection.

Also, I do not understand how exactly does depth in depth matrix exactly relate to coordinate system of the world and how does it relate to viewer coordinate system (from camera's perspective).

I tried to track where do these matrices come from, but I ended here where it seems to me like matrix values are somehow magically being filled without calling any GTA V related function.

I would be glad for any clarification.

barcharcraz commented 6 years ago

So all the transformation matrices are "graphics style" instead of vision style. In graphics we like to keep everything 4x4, since it's (a) better aligned in memory, (b) does not discard any information. The view matrix transforms things from "world space" to "camera space". This is rigid. The projection matrix goes from there to normalized device coordinates, which is some cube, usually with sides of 2 units. The actual transformation into image space (or fragment space) gets done for us on the graphics card.

The reason that all elements are nonzero is because we pull these transforms out of GPU memory before a render call, and GTA sends just one transformation to the GPU, (P * V). We then construct a view matrix based on the camera location as reported by the scripting API and use it to extract just the projection matrix. The nonzero elements come about from floating point precision losses during this process. The nonzero elements (aside from the usual nonzero elements of a projection matrix) should be quite close to zero.

The depth is in NDC space.

racinmat commented 6 years ago

Thanks for answers. I am still not sure whether I understood all of it. What exactly is the device coordinates and how does it differ from the "camera space"? And I did not get that "with sides of 2 units" means. So when actual transformation into image space gets done for us on the graphics card, does it mean that it's not possible to calculate it from matrices in the postprocessing, but we need to calculating during the extraction, when GTA V is running, via the native call for transformation?

barcharcraz commented 6 years ago

Device coordinates is the "image space" that a GPU uses, it differs from what we'd refer to as image space in machine vision in that it's always the same no matter what the actual framebuffer resolution you're using is. The graphics hardware deals with mapping these NDC values to screen space ((0, 0) -> (1920, 1080) for example on a 1080p monitor). The idea here is that you can change the render resolution by simply changing how the GPU samples points and how it does the NDC -> viewport transform.

For a much more clear and thought out explanation I recommend sections 2 through 4 of "Real time Rendering, Third edition" (ISBN 978-1-56881- 424-7)

Charlie

On Wed, 2017-11-29 at 11:39 +0000, Matěj Račinský wrote:

Thanks for answers. I am still not sure whether I understood all of it. What exactly is the device coordinates and how does it differ from the "camera space"? And I did not get that "with sides of 2 units" means. So when actual transformation into image space gets done for us on the graphics card, does it mean that it's not possible to calculate it from matrices in the postprocessing, but we need to calculating during the extraction, when GTA V is running, via the native call for transformation? — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.