tensorflow / models

Models and examples built with TensorFlow
Other
76.95k stars 45.79k forks source link

model.ckpt.data-00000-of-00001: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator? #2676

Closed wpq3142 closed 6 years ago

wpq3142 commented 6 years ago

System information

Describe the problem

download the new :faster_rcnn_inception_resnet_v2_atrous_coco_11_06_2017.tar.gz

rfcn_resnet101_coco.config : model { faster_rcnn { num_classes: 37 image_resizer { keep_aspect_ratio_resizer { min_dimension: 600 max_dimension: 1024 } } feature_extractor { type: 'faster_rcnn_inception_resnet_v2' first_stage_features_stride: 8 }

Source code / logs

2017-11-01 15:11:40.186072: W tensorflow/core/util/tensor_slice_reader.cc:95] Could not open /home/wpq/data/potato/data/model.ckpt.data-00000-of-00001: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator? Traceback (most recent call last): File "/home/wpq/workspace/models-master/research/object_detection/train.py", line 163, in tf.app.run() File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/home/wpq/workspace/models-master/research/object_detection/train.py", line 159, in main worker_job_name, is_chief, FLAGS.train_dir) File "/home/wpq/workspace/models-master/research/object_detection/trainer.py", line 254, in train var_map, train_config.fine_tune_checkpoint)) File "/home/wpq/workspace/models-master/research/object_detection/utils/variables_helper.py", line 122, in get_variables_available_in_checkpoint ckpt_reader = tf.train.NewCheckpointReader(checkpoint_path) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 150, in NewCheckpointReader return CheckpointReader(compat.as_bytes(filepattern), status) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 473, in exit c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.DataLossError: Unable to open table file /home/wpq/data/potato/data/model.ckpt.data-00000-of-00001: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?

Process finished with exit code 1

wpq3142 commented 6 years ago

File format is inconsistent,Look at posts: http://votec.top/2016/12/24/tensorflow-r12-tf-train-Saver/

slim.get_or_create_global_step() change to: tf.train.get_or_create_global_step()

scotthuang1989 commented 6 years ago

@wpq3142 this exception raised at here:

ckpt_reader = tf.train.NewCheckpointReader(checkpoint_path)

I don't dive into the implementation of this API, but I suppose this API is for new format.

jart commented 6 years ago

I'm assuming the model code here would need to be updated to maybe determine which format the checkpoint is written in, and if so, use the correct API? If so, that sounds like a straightforward change and we'd welcome contributions helping to clean up the model.

tombstone commented 6 years ago

@wpq3142 Can you tell us how you are configuring this particular entry in the config: fine_tune_checkpoint: "PATH_TO_BE_CONFIGURED/model.ckpt".

It should look like fine_tune_checkpoint: "/home/wpq/data/potato/data/model.ckpt"

Moreover, it also looks like you are using rfcn_resnet101_coco.config with a faster_rcnn_inception_resnet_v2_atrous_coco_11_06_2017 checkpoint. These two are not compatible. You need use rfcn_resnet101_coco_11_06_2017.tar.gz with the rfcn_resnet101_coco.config

wpq3142 commented 6 years ago

@tombstone

I downloaded the latest model,It's working right now,Configuration is as follows: --clone_on_cpu true --logtostderr --pipeline_config_path /home/wpq/data/potato/model/faster_rcnn_nas_coco.config --train_dir /home/wpq/data/potato/model/train

For one reason, I seem to lack a space between keys and values,

paulrich1234 commented 6 years ago

you just need to restore (.ckpt) not (.ckpt.meta) something like this :+1: sess = tf.Session() saver.restore(sess, 'mymodel/model100-500-0.998.ckpt')

pbashivan commented 5 years ago

Apparently in V2 checkpoints, you should only include the filename up to ".ckpt". For instance if the checkpoint filename is model.ckpt.data-00000-of-00001 then you should only use model.ckpt. Using the full filename leads to getting a DataLossError.

praneethpj commented 5 years ago

Apparently in V2 checkpoints, you should only include the filename up to ".ckpt". For instance if the checkpoint filename is model.ckpt.data-00000-of-00001 then you should only use model.ckpt. Using the full filename leads to getting a DataLossError.

@pbashivan thank you so much

shellyfung commented 5 years ago

I have fixed the issue by this: replace model.ckpt the model.ckpt-200000 where 20000 is your checkpoint number

codexponent commented 5 years ago

Solved on #7696

Rajamohanreddyai commented 5 years ago

Hello all, just follow the below video and export your own model with in a 10 seconds

https://youtu.be/w0Ebsbz7HYA

phosseini commented 5 years ago

Apparently in V2 checkpoints, you should only include the filename up to ".ckpt". For instance if the checkpoint filename is model.ckpt.data-00000-of-00001 then you should only use model.ckpt. Using the full filename leads to getting a DataLossError.

This works, and in my case, I used the longest common prefix among my check point related files which was model.ckpt-1000000 and it worked for me. I had the three following files in my folder:

model.ckpt-1000000.data-00000-of-00001 model.ckpt-1000000.index model.ckpt-1000000.meta

I just thought this might be the case for some folks.

patspeis commented 5 years ago

I was running into this and this worked for me. All I had to do was run the following on my windows 10 x64 machine and it worked:

python export_inference_graph.py --input_type image_tensor --pipeline_config_path ssd_mobilenet_v1_coco.config --trained_checkpoint_prefix models\model.ckpt-1000 --output_directory tuned_model

Instead of:

python export_inference_graph.py --input_type image_tensor --pipeline_config_path ssd_mobilenet_v1_coco.config --trained_checkpoint_prefix models\model.ckpt-1000.data-###-### --output_directory tuned_model

tl;dr Dont reference single files in the --trained_checkpoint_prefix flag. Just reference the batch (the prefix) of those three files.

Hope it helps.

anjani-dhrangadhariya commented 4 years ago

@phosseini is correct. The model itself is made up of three different files with three different extensions showing what kind of model data each file stores.

For me too, using the longest shared file name prefix solved the issue.

model.ckpt-1000000.data-00000-of-00001
model.ckpt-1000000.index
model.ckpt-1000000.meta
kamrankausar commented 4 years ago

tensorflow.python.framework.errors_impl.DataLossError: Unable to open table file ./model_dir/model.ckpt-1000000.data-00000-of-00001: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?

snrnsrk06 commented 4 years ago

I am trying to run opened project properly, the code saved files as model-10.data-0000-of-0001, .index, .meta. and The part in code to save files is described as below:

saver = tf.train.Saver(max_to_keep=50)

if self.pretrained_model is not None:
        print("Start training with pretrained Model..")
        saver.restore(sess, self.pretrained_model)

if (e + 1) % self.save_every == 0:
          saver.save(sess, self.model_path + 'model', global_step=e + 1)
          print("model-%s saved." % (e + 1))

One of solution in this issue is to change the file name.

model.ckpt-1000000.data-00000-of-00001 model.ckpt-1000000.index model.ckpt-1000000.meta

How to touch the code in my situation? How to change the file name? It looks the save method determine file name automatically. Or should i change the file name manually?

/////////////////////////////////////////////////////////////////////////////////////////////

It can be

if (e + 1) % self.save_every == 0:
                    saver.save(sess, self.model_path + 'model.ckpt', global_step=e + 1)
                    print("model-%s saved." % (e + 1))

but not enough

saver.restore(sess, self.model_path + cur_model2)

cur_model is 'model.ckpt-50.data-0000-of-0001', .index, .meta.

cur_model2 = cur_model[0:cur_model.find('-') + cur_model[cur_model.find('-'):].find('.')]
saver.restore(sess, self.model_path + cur_model2)

Just include file name in restore.

cur_model2 is 'model.ckpt-50'

Rajput245 commented 4 years ago

none of the above worked. model.ckpt-1000000 model.ckpt-1000000.index model.ckpt-1000000.meta solved this problem for me..

dome272 commented 4 years ago

Apparently in V2 checkpoints, you should only include the filename up to ".ckpt". For instance if the checkpoint filename is model.ckpt.data-00000-of-00001 then you should only use model.ckpt. Using the full filename leads to getting a DataLossError.

you are a legend

mikelty commented 3 years ago

in some models, it could also be caused by lacking a .meta file and / or a .index file.

BassantTolba1234 commented 3 years ago

Please all, After I trained the tensrflow session , I do not have the name of files as .ckpt.data model.ckpt-1000000.data-00000-of-00001 model.ckpt-1000000.index model.ckpt-1000000.meta but instead Pretrained.data-00000-of-00001 Pretrained.index Pretrained.meta what should I do to solve the above problem of Data loss with my these saved files ??

saramsv commented 3 years ago

none of the above worked. model.ckpt-1000000 model.ckpt-1000000.index model.ckpt-1000000.meta solved this problem for me..

@Rajput245 I have the same problem. Were you able to fix it?

joan-yanqiong commented 2 years ago

Hi guys, I don't know if it is still a problem for you, but I had the following files: model.ckpt-100000.data-00000-of-00001 model.ckpt-100000.index model.ckpt-100000.meta

When I used the following code:

import tensorflow.compat.v1 as tf
import tf_slim as slim

checkpoint_path = absolute_path_to/model.ckpt-100000

init_fn = slim.assign_from_checkpoint_fn(
        checkpoint_path, slim.get_model_variables(model_variables))
sess = tf.Session()
init_fn(sess)

I hope this helps you!

pinzhi000 commented 2 years ago

In my situation I don't have "ckpt" at all.

I just have the following 2 files: image

What do I do?

joan-yanqiong commented 2 years ago

I would maybe try to just add the ckpt after 'variables'.

pinzhi000 commented 2 years ago

I just resolved this issue. I saved the model as a .h5 file and that worked.

yohannesSM commented 2 years ago

import tensorflow as tf from tensorflow.python.training import checkpoint_utils as cp print(cp.list_variables('path/model_name.ckpt'))

use only the model name up to the .ckpt part. Do not other magical numbers