tensorflow / tfjs

A WebGL accelerated JavaScript library for training and deploying ML models.
https://js.tensorflow.org
Apache License 2.0
18.36k stars 1.92k forks source link

posenet technical questions #1137

Closed gustavz closed 4 years ago

gustavz commented 5 years ago

To get help from the community, we encourage using Stack Overflow and the tesnorflow.js tag.

TensorFlow.js version newest

Browser version: Mozilla Quantum newest

Describe the problem or feature request

I have two technical questions regarding the posnet model:

  1. What are the dimension of the 4 parallel convolutions that are applied to the output of the mobilenet (offset, heatmap, displacementFwd, displacementBwd). Basically what is the Number of filters and the kernel size? They seem all to be equal, is that correct?

  2. What is the loss function used to train the model end-to-end ? Is there a dedicated paper or explanation available somewhere?

Thank you very much! Gustav

gustavz commented 5 years ago

Update:

  1. the convolutions are 1x1
  2. found something in the PersonLab paper, but would still be interested in an answer.

New questions: What are the Input dimensions that posenet was trained on? When re-using your weights, setting the input dims to the same value should result in the best performance, right?

I thought it should have been something common to mobilenet like 224x224, but the input dimensions of the relaesed tflite weights (https://storage.googleapis.com/download.tensorflow.org/models/tflite/gpu/multi_person_mobilenet_v1_075_float.tflite) surprised me: They are 353x257. Why those uneven numbers?

Furthermore: What is mobilenet_v1_101 supposed to be? As I understand it is exactle the same as mobilenet_v1_100 (1.0 MobileNet V1 to use another terminology)

dsmilkov commented 5 years ago

cc @oveddan and @tylerzhu-github for technical details regarding the posenet architecture.

tylerzhu-github commented 5 years ago
  1. What are the dimension of the 4 parallel convolutions that are applied to the output of the mobilenet (offset, heatmap, displacementFwd, displacementBwd).

You are absolutely right and they are all 1x1 convolutions. The number of output channels depends on the number of keypoints. More details can be found here: https://arxiv.org/abs/1701.01779 https://arxiv.org/abs/1803.08225

  1. What is the loss function used to train the model end-to-end ? Is there a dedicated paper or explanation available somewhere?

Yes, more training details are described in the following two papers. Happy to discuss more. https://arxiv.org/abs/1701.01779 https://arxiv.org/abs/1803.08225

  1. What are the Input dimensions that posenet was trained on?

We set the height and width of the crop to be a multiple of 16, plus 1 for feature alignment, hence the odd crop sizes. We set the crop height to 353, close to the 340 value used in DeepCut. We trained it (multi_person_mobilenet_v1_075_float.tflite) with 801 resolution. The input resolution can be changed since the model is fully convolutional depending on your computation budget.

  1. When re-using your weights, setting the input dims to the same value should result in the best performance, right?

Yes, different input resolution can be used. We have an ablation study in our paper: https://arxiv.org/abs/1803.08225 (see Table 4 for more details)

  1. Furthermore: What is mobilenet_v1_101 supposed to be? As I understand it is exactle the same as mobilenet_v1_100 (1.0 MobileNet V1 to use another terminology)

Yes mobilenet_v1_101 is almost the same as mobilenet_v1 with 100% depth multiplier. We did modify the last pooling and fc layers of the MobileNetV1 classification model to be more suitable for the pose estimation task.

gustavz commented 5 years ago

Hi @tylerzhu-github and @oveddan,

now I got a couple of more questions:

  1. why does posenet need the +1 as valid resolution? When I run it on 224x224 it upsacles to 225x225, and when I run it with 225x225 directly it does not change. I know and understand the function that does this scaling with Resolution = ((InputImageSize - 1) / OutputStride) + 1, so my question is not related to the util code but rather why the model needs this?
  2. Why (1) confuses me so much is because if I take the model an port it to tensorflow lite (which need fixed input dimensions) the conversion works with input dimension 224x224 but results in the following error when I set it to 225x225:
    RuntimeError: TOCO failed see console for info.
    b'2019-03-04 14:10:09.455147: I tensorflow/contrib/lite/toco/graph_transformations/graph_transformations.cc:39]
    Before Removing unused ops: 213 operators, 318 arrays (0 quantized)\n2019-03-04 14:10:09.460528: I tensorflow/contrib/lite/toco/graph_transformations/graph_transformations.cc:39]
    Before general graph transformations: 213 operators, 318 arrays (0 quantized)\n2019-03-04 14:10:09.467226: F tensorflow/contrib/lite/toco/graph_transformations/propagate_fixed_sizes.cc:991]
    Check failed: height_with_paddings % block_height == 0 (1 vs. 0)\n'
  3. Following: How am I able to successfully run the model on tensorflow lite when it needs a fixed input dimension of something like 225x225 when toco / the tflite converter fails to convert it to this size
tylerzhu-github commented 5 years ago

Hi Gustav,

Thanks for posting the followup questions here. Very good questions! Please see my reply inline:

  1. why does posenet need the +1 as valid resolution? When I run it on 224x224 it upsacles to 225x225, and when I run it with 225x225 directly it does not change.

It is based on the following rule: height and width of the input crop for the model should be a multiple of 16, plus 1 for feature alignment. To my best knowledge, the convnet feature alignment technique was proposed and popularized by researchers @gpapan and @aquariusjay. The assumptions are that: 1) the model runs fully convolutionally on your input crop, 2) the conv + pool kernel size is odd, 3) use "SAME" padding 4) stride is 2 for conv + pool. If 1), 2), 3), 4) are met, then one should use the feature alignment rule for dense prediction tasks in order to ensure the spatial output tensors aligns with the input crop image. (classification output task doesn't require such strict alignment).

  1. Why (1) confuses me so much is because if I take the model an port it to tensorflow lite (which need fixed input dimensions) the conversion works with input dimension 224x224 but results in the following error when I set it to 225x225:

Could you share the commandline you used (I would be happy to help take a look) and also would be great to raise the question to TensorFlow Lite developer support team? BTW I think TensorFlow Lite team will have a page here for pose model instructions.

gustavz commented 5 years ago

Hi @tylerzhu-github, thanks again for your detailed answers!

I am using TF 1.3 where lite is still under contrib/. Furthermore I am using a script based on this one by @rwightman to convert the posnet tensorflow js model to python tensorflow.

The conversion to tflite is then done with:

converter = tf.contrib.lite.TocoConverter.from_frozen_graph(
    graph_def_file=os.path.join(model_dir, "posenet_%s.pb" % chkpoint), 
    input_arrays=['image'], 
    output_arrays=['heatmap','offset_2','displacement_fwd_2','displacement_bwd_2'],
    input_shapes={'image':[1,width,height,3]}
    )
converter.post_training_quantize = False
tflite_model=converter.convert()
open(os.path.join(model_dir, "posenet_{}_{}_{}.tflite".format(chkpoint,str(width),str(height))),"wb").write(tflite_model)

Currently I am working around the displacement that occurs when exporting the model to 224x224 (without +1 feature alignment, because I dont know how to do that on android with a tflite model with fixed input sizes) with just adding +16 to the resulting posenet detection coordinates after the decoding algorithm. Is this a valid way to approcimate the +1 feature alignment?

saras-verihelp commented 5 years ago

Hi,

I would like to know how the output array sizes are derived for the displacement forward and backward output tensors. I could find the details for heatmap and offset vectors in this link: https://medium.com/tensorflow/real-time-human-pose-estimation-in-the-browser-with-tensorflow-js-7dd0bc881cd5. Would like to know, if the input image size would have any effect on the other two output sizes.

Regards Saraswathy.

tylerzhu-github commented 5 years ago

Hi Saraswathy,

Thanks for posting the question. The displacement vectors' unit is in the input image space. The model runs fully convolutionally on the input image. Theoretically it should work fine without explicitly rescaling the displacement vectors.

rthadur commented 4 years ago

Automatically closing due to lack of recent activity. Please update the issue when new information becomes available, and we will reopen the issue. Thanks!