tensorflow / tpu

Reference models and tools for Cloud TPUs.
https://cloud.google.com/tpu/
Apache License 2.0
5.21k stars 1.77k forks source link

MRCNN SpineNet-143 and SpineNet-190 checkpoints are broken #775

Closed apls777 closed 4 years ago

apls777 commented 4 years ago

The easiest way to check it is to use the inspect_checkpoint.py script:

python inspect_checkpoint.py --file_name=data/spinenet-190/model.ckpt --all_tensors

It's printing the variables and at some point shows the following error:

...
tensor_name:  spinenet/sub_policy12/resample_with_alpha_resample_12_0/conv2d_1/kernel
Read less bytes than requested

The same happens for the SpineNet-143 checkpoint:

...
tensor_name:  spinenet/sub_policy7/resample_with_alpha_resample_7_1/conv2d_1/kernel
Read less bytes than requested

Moreover, the data files in the SpineNet-143 and SpineNet-190 checkpoints have exactly the same size:

$ ls -la data/spinenet-143
total 591392
drwxr-xr-x 2 root   root       4096 May 17 14:09 .
drwxr-xr-x 8 root   root       4096 May 17 14:29 ..
-rw-rw-r-- 1 655311 89939        78 Apr  9 05:26 checkpoint
-rw-rw-r-- 1 655311 89939 536870913 Apr  9 05:25 model.ckpt.data-00000-of-00001
-rw-rw-r-- 1 655311 89939     29378 Apr  9 05:25 model.ckpt.index
-rw-rw-r-- 1 655311 89939  68659297 Apr  9 05:25 model.ckpt.meta
$ ls -la data/spinenet-190
total 619880
drwxr-xr-x 2 root   root       4096 May 17 14:21 .
drwxr-xr-x 8 root   root       4096 May 17 14:29 ..
-rw-rw-r-- 1 655311 89939        78 May 16 18:40 checkpoint
-rw-rw-r-- 1 655311 89939 536870913 May 16 18:36 model.ckpt.data-00000-of-00001
-rw-rw-r-- 1 655311 89939     37559 May 16 18:36 model.ckpt.index
-rw-rw-r-- 1 655311 89939  97822663 May 16 18:36 model.ckpt.meta

It looks like the variables that didn't fit into the 512 MB of the data file were removed.

Could you, please, reupload those checkpoints to the GS bucket?

@xianzhidu @pengchongjin

pengchongjin commented 4 years ago

Thanks for digging into this.

We can investigate the compressed checkpoint and will fix it soon.

xianzhidu commented 4 years ago

Thanks for letting us know. Checkpoints updated.

apls777 commented 4 years ago

Great! Thank you, Xianzhi!

raj-shah commented 3 years ago

@apls777 I'm still seeing a broken checkpoint wrt spinenet-190 in #913, any chance you could confirm this still works?

apls777 commented 3 years ago

@raj-shah I can confirm that it worked a year ago when they re-uploaded this checkpoint. I guess they changed the codebase slightly but didn't update some checkpoints, so I would suggest you try to check out a 1-year-old version of this repo and try this checkpoint again.

raj-shah commented 3 years ago

@apls777 thanks for the tip! I suspected the same and have already tried all branches (up to and including r2.1) but sadly no luck!

apls777 commented 3 years ago

@raj-shah Try r1.15, it looks like r1.x and r2.x branches are being updated independently.

raj-shah commented 3 years ago

@apls777 I seem to have missed r2.2.0, works perfectly now. Thanks for looking into it!