tensorflow / tpu

Reference models and tools for Cloud TPUs.
https://cloud.google.com/tpu/
Apache License 2.0
5.21k stars 1.77k forks source link

Mnasnet - Channels first data format issue #417

Open pidajay opened 5 years ago

pidajay commented 5 years ago

Using channels first data format causes the script to fail because of a number of reasons

  1. The pre-processing steps assumes the data will always be channels last (even after the data format is set to channels first).
  2. There are also several layers in the model (both efficient net and mnasnet) that don't accept the data format (and hence assumes channels last). For eg. Conv2D. After fixing 1 and 2 I can get the model to run somewhat but fails to run for multi-gpu cases.

Moreover, channels first data format actually causes training to be slower on GPU compared to channels last. This is puzzling since all sources indicate channels first performs better on GPU.

This issue was originally raised here but got closed prematurely - https://github.com/tensorflow/tpu/issues/410

saberkun commented 5 years ago

Thanks for reporting. Yes, I just fixed the bugs inside the code to support channels_first convolution layout. As TPU utilizes XLA compiler, the memory layout will be optimized.

pidajay commented 5 years ago

Awesome thanks! Just curious if you get to test the code on GPUs? If so what is the performance like?

saberkun commented 5 years ago

@pidajay, Yes, I tested with an cuda smoke test on GPU like running a few steps. I am afraid we do not have bandwidth/resource to really look into GPU performance. If you have any suggestion or finding, please let us know. What if you feed with channels_last input, the model_fn has a transpose to change channels_last to channels_first. I would suggest something like creating your own branch and experiment with GPU xla and fp16 training. Thanks