Closed jhihn closed 5 years ago
@jhihn This means your checkpoint file got corrupted. Delete the latest version (i.e. the one with the largest global_step number) and try again and it should work.
That's not really an acceptable fix. It happens regularly. There is no reason for it to happen. I think there's a problem with TF, or it's checksumming. It's not a 1-off thing. It "got corrupted" -- I question if that is true.
Perhaps there is a way to validate the file? Just because checksums differ does not mean it is "corrupt"? How do I know it was not generated as corrupt to begin with? Why is this check included? Can we further isolate the corruption?
I've read the other DataLossError threads and they all get closed because of inactivity. I think there's a problem in TF that's not getting fixed.
@jhihn: Have you considered the possibility that there could be a problem with the disk in use?
TensorFlow checkpoint files contain checksums to guard against accidental modifications of the stored data while being written, stored, or read back from the filesystem. They are used any time a TensorFlow program anywhere reads back a model weight from a checkpoint file, which means the code to compute and validate checksums is widely tested in practice.
In any case, this has nothing to do with TensorFow Hub.
I am an admin for a Tensorflow machine, it has dual 1080 Tis. The primary user is reporting the repeated DataLossErrors. The disk this is happening on is a NVME disk with 596 Gigs free. 4.13.0-43-generic #48~16.04.1-Ubuntu SMP Thu May 17 12:56:46 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
It can run hours, sometimes days before this happens.
What can you recommend on how to prevent this, to log adequate information, or otherwise deal with this? There is no reason why the "corruption" should be occurring that I know of. All virus scan is disabled, there is only one active user.