Efficient Net architecture on Edge TPU

tensorflow / tpu

Reference models and tools for Cloud TPUs.

https://cloud.google.com/tpu/

Apache License 2.0

5.22k stars 1.77k forks source link

Efficient Net architecture on Edge TPU #556

Open wuhy08 opened 5 years ago

wuhy08 commented 5 years ago

Hi @mingxingtan

By reading the code and visualization of tflite of Efficient Net Edge TPU version, it seems the implementation for EdgeTPU is different from the non-Edge-TPU one (B0-B7). I noticed two differences, one is that no SE module was used and the activation function is changed to ReLU. I wonder what is the insight behind the change? Is that because Edge TPU doesn't support Swish? Then how about SE module?

Did I miss any other differences in the architecture?

Thank you!

mingxingtan commented 5 years ago

Hi Haoyu, we exclude both SE and swish due to some temporary EdgeTPU hardware/software stack limitations. Besides that, the architecture is also slightly different, please refer to the efficientnet_edgetpu_builder.py for more details. Thanks!

wuhy08 commented 5 years ago

Hi @mingxingtan

Thank you so much for the quick response! It is very helpful.

Now I am reading the code efficientnet_edgetpu_builder.py. I noticed that in the architecture is described as:

  blocks_args = [
      'r1_k3_s11_e4_i24_o24_c1_noskip',
      'r2_k3_s22_e8_i24_o32_c1',
      'r4_k3_s22_e8_i32_o48_c1',
      'r5_k5_s22_e8_i48_o96',
      'r4_k5_s11_e8_i96_o144',
      'r2_k5_s22_e8_i144_o192',
  ]

It is noted in the first MBBlock, # of input channel is 24. However, the # of output channel fed from the stem is 32. Although 24 * 4 % 32 == 0, is there a reason to list the first MBBlock as r1_k3_s11_e4_i24_o24_c1_noskip instead of r1_k3_s11_e3_i32_o24_c1_noskip?

Thank you!

mingxingtan commented 5 years ago

Good point, it is probably a historical typing error. Seems like "r1_k3_s11_e3_i32_o24_c1_noskip" makes more sense, although they are equivalent.

wuhy08 commented 5 years ago

Hi @mingxingtan

Now I found some discrepancies between r1_k3_s11_e4_i24_o24_c1_noskip and r1_k3_s11_e3_i32_o24_c1_noskip for 1st block or equivalently Stage 2 (as named in your paper, table 1).

Everything is fine when we use Model S. But when we switch to Model M and L, since the compound scaling also happens in depth (number of repeats of MBConv in each Stage). For example, in Model M, 1st block (or equiv. stage 2) has 2 MBConvs. In you current architecture setup (r1_k3_s11_e4_i24_o24_c1_noskip), the dimensions of the weights of the expand_conv in each MBConv would be (3, 3, 32, 96) and (3, 3, 24, 96). However, if we use the setting r1_k3_s11_e3_i32_o24_c1_noskip, then the second one would become (3, 3, 24, 72)...

sarahmass commented 5 years ago

I have found a discrepancy too. If you have 32 filters coming in and the block arguments are r1_k3_s11_e4_i24_o24_c1_noskip then the expansion layer will only expand to 96 instead of 128 like it should if it the argument parameters are r1_k3_s11_e3_i32_o24_c1_noskip . Below is an image of my stage 1 conv block that is built with the first set of arguments.

When using the correct settings (i32 instead of i24) the pre-trained model does not work because the parameters no longer match up. So in effect the pre-trained model has an expansion parameter of 3 instead of 4.

mingxingtan commented 5 years ago

@sarahmass @wuhy08 Good finding! Looks like we have to let the mistake stay there for a while in order to be compatible to pretrained checkpoints. But still, thanks for pointing these out!