krasserm / super-resolution

Tensorflow 2.x based implementation of EDSR, WDSR and SRGAN for single image super-resolution
Apache License 2.0
1.5k stars 352 forks source link

Training train_step freezes when using @tf.function #39

Open Cristy94 opened 4 years ago

Cristy94 commented 4 years ago

I tried training on a custom dataset, but the train function always got stuck in train_step at the return statement. After spending 2 hours to understand why the function is called twice without ever returning and why it gets stuck I realized it's because of the @tf.function decorator. As soon as I removed that decorator the training worked as expected.

Why does that decorator make the train_step function get stuck? Is it safe to remove it? Is it something wrong with the function that makes it incompatible with @tf.cunction ?

https://github.com/krasserm/super-resolution/blob/602a490ec62045823e37c475229e3bc42c8d850c/train.py#L74-L86

krasserm commented 4 years ago

Hi @Cristy94, thanks for using this project and sharing your findings. This is something I found with the TF 2.0 release candidates but disappeared with the TF 2.0 final release so I thought this was fixed in TF. But after upgrading to TF 2.1, this became again an issue when training WDSR models (that use Tensorflow Addons), and removing @tf.function fixes it.

Why does that decorator make the train_step function get stuck?

I don't know yet, have to investigate it.

Is it safe to remove it?

Yes it is safe, but may impact training performance (speed), although I didn't measure the difference yet.

Is it something wrong with the function that makes it incompatible with @tf.cunction ?

Good question. Again, needs to be investigated.

I'll leave this ticket open until I fixed these issues with WDSR and SRGAN+WDSR in master and hopefully have then answers to your questions.

pythonmobile commented 4 years ago

Has this been fixed yet for TF2.1? Thanks.