simonwsw / deep-soli

Gesture Recognition Using Neural Networks with Google's Project Soli Sensor
MIT License
139 stars 51 forks source link

Question: Dimensions of input data #14

Closed graulef closed 7 years ago

graulef commented 7 years ago

I am a bit confused by the first layer of the CNN. As far as I unterstand, the first layer (SpatialConvolution) takes in 4 planes of 2D image data (RDI), runs a 2D convolution on all the 4 layers and outputs something with 32 planes.

Two questions one this: Why is the output so deep? (I know this question is really general and probably basic) Why are the input images only 2D? After all, they themselves are RGB, right? So they should be 3D too. Or is everything converted to B/W during the generation of the meanfile?

Thanks in advance!

graulef commented 7 years ago

I just noticed that in the paper, the resolution of the RDIs is said to be 224x224 pixels. However, in the code provided here it is 32x32. Is there any specific reason for this difference?

simonwsw commented 7 years ago

Why is the output so deep?

This helps to extract different features of the original input.

Why are the input images only 2D?

The Range-Doppler Images are 2D only. It's not taken from RGB camera.

224 224 or 32 32

The resolution of RDIs is 32 32. The 224 224 interpolated input is used for performance comparison with VGG nets.

graulef commented 7 years ago

Thank you for your answers!