thtrieu / darkflow

Translate darknet to tensorflow. Load trained weights, retrain/fine-tune using tensorflow, export constant graph def to mobile devices
GNU General Public License v3.0
6.13k stars 2.08k forks source link

Can't load last checkpoint #777

Open mohamedabdallah1996 opened 6 years ago

mohamedabdallah1996 commented 6 years ago

after I trained the model for many epochs, when I make forward pass with image to see the detection with this command:

flow --imgdir sample_img/ --model "cfg/tiny-yolo-voc-logos.cfg" --load -1 --gpu 1.0 --json --threshold 0.001 I got this error:

Parsing cfg/tiny-yolo-voc-logos.cfg
Loading None ...
Finished in 9.918212890625e-05s

Building net ...
Source | Train? | Layer description                | Output size
-------+--------+----------------------------------+---------------
       |        | input                            | (?, 416, 416, 3)
 Init  |  Yep!  | conv 3x3p1_1  +bnorm  leaky      | (?, 416, 416, 16)
 Load  |  Yep!  | maxp 2x2p0_2                     | (?, 208, 208, 16)
 Init  |  Yep!  | conv 3x3p1_1  +bnorm  leaky      | (?, 208, 208, 32)
 Load  |  Yep!  | maxp 2x2p0_2                     | (?, 104, 104, 32)
 Init  |  Yep!  | conv 3x3p1_1  +bnorm  leaky      | (?, 104, 104, 64)
 Load  |  Yep!  | maxp 2x2p0_2                     | (?, 52, 52, 64)
 Init  |  Yep!  | conv 3x3p1_1  +bnorm  leaky      | (?, 52, 52, 128)
 Load  |  Yep!  | maxp 2x2p0_2                     | (?, 26, 26, 128)
 Init  |  Yep!  | conv 3x3p1_1  +bnorm  leaky      | (?, 26, 26, 256)
 Load  |  Yep!  | maxp 2x2p0_2                     | (?, 13, 13, 256)
 Init  |  Yep!  | conv 3x3p1_1  +bnorm  leaky      | (?, 13, 13, 512)
 Load  |  Yep!  | maxp 2x2p0_1                     | (?, 13, 13, 512)
 Init  |  Yep!  | conv 3x3p1_1  +bnorm  leaky      | (?, 13, 13, 1024)
 Init  |  Yep!  | conv 3x3p1_1  +bnorm  leaky      | (?, 13, 13, 1024)
 Init  |  Yep!  | conv 1x1p0_1    linear           | (?, 13, 13, 185)
-------+--------+----------------------------------+---------------
GPU mode with 1.0 usage
2018-05-25 22:03:37.945042: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-05-25 22:03:37.945750: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:04.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-05-25 22:03:37.945804: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-05-25 22:03:38.209208: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-05-25 22:03:38.209299: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0 
2018-05-25 22:03:38.209343: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N 
2018-05-25 22:03:38.209776: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11439 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7)
2018-05-25 22:03:38.247796: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 11.17G (11995578368 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
4
Traceback (most recent call last):
  File "/usr/local/bin/flow", line 6, in <module>
    exec(compile(open(__file__).read(), __file__, 'exec'))
  File "/content/drive/CoLab/YOLO/darkflow/flow", line 6, in <module>
    cliHandler(sys.argv)
  File "/content/drive/CoLab/YOLO/darkflow/darkflow/cli.py", line 26, in cliHandler
    tfnet = TFNet(FLAGS)
  File "/content/drive/CoLab/YOLO/darkflow/darkflow/net/build.py", line 76, in __init__
    self.setup_meta_ops()
  File "/content/drive/CoLab/YOLO/darkflow/darkflow/net/build.py", line 151, in setup_meta_ops
    if self.FLAGS.load != 0: self.load_from_ckpt()
  File "/content/drive/CoLab/YOLO/darkflow/darkflow/net/help.py", line 25, in load_from_ckpt
    last = f.readlines()[-1].strip()
IndexError: list index out of range

How to load the last checkpoint ? Any suggestion please !!

youyuge34 commented 6 years ago

Please check the checkpoint file in the ~/ckpt, u can edit the file content to get a correct last saving file. Or just manually use the --load 1500 command line instead of -1 to load the ckpt, which 1500 is the steps in your ckpt file name under the ~/ckpt/ dir.

mohamedabdallah1996 commented 6 years ago

I tried to use --load 1500 but it turns out that the ckpt doesn't save all the checkpoints I got this error:

Parsing cfg/tiny-yolo-voc-logos.cfg
Loading None ...
Finished in 7.343292236328125e-05s

Building net ...
Source | Train? | Layer description                | Output size
-------+--------+----------------------------------+---------------
       |        | input                            | (?, 416, 416, 3)
 Init  |  Nope  | conv 3x3p1_1  +bnorm  leaky      | (?, 416, 416, 16)
 Load  |  Nope  | maxp 2x2p0_2                     | (?, 208, 208, 16)
 Init  |  Nope  | conv 3x3p1_1  +bnorm  leaky      | (?, 208, 208, 32)
 Load  |  Nope  | maxp 2x2p0_2                     | (?, 104, 104, 32)
 Init  |  Nope  | conv 3x3p1_1  +bnorm  leaky      | (?, 104, 104, 64)
 Load  |  Nope  | maxp 2x2p0_2                     | (?, 52, 52, 64)
 Init  |  Nope  | conv 3x3p1_1  +bnorm  leaky      | (?, 52, 52, 128)
 Load  |  Nope  | maxp 2x2p0_2                     | (?, 26, 26, 128)
 Init  |  Nope  | conv 3x3p1_1  +bnorm  leaky      | (?, 26, 26, 256)
 Load  |  Nope  | maxp 2x2p0_2                     | (?, 13, 13, 256)
 Init  |  Nope  | conv 3x3p1_1  +bnorm  leaky      | (?, 13, 13, 512)
 Load  |  Nope  | maxp 2x2p0_1                     | (?, 13, 13, 512)
 Init  |  Nope  | conv 3x3p1_1  +bnorm  leaky      | (?, 13, 13, 1024)
 Init  |  Yep!  | conv 3x3p1_1  +bnorm  leaky      | (?, 13, 13, 1024)
 Init  |  Yep!  | conv 1x1p0_1    linear           | (?, 13, 13, 185)
-------+--------+----------------------------------+---------------
GPU mode with 1.0 usage
2018-05-29 15:11:05.646865: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-05-29 15:11:05.647465: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:04.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-05-29 15:11:05.647549: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-05-29 15:11:05.910917: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-05-29 15:11:05.910995: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0 
2018-05-29 15:11:05.911046: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N 
2018-05-29 15:11:05.911403: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11439 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7)
2018-05-29 15:11:05.948364: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 11.17G (11995578368 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
./ckpt/
Loading from ./ckpt/tiny-yolo-voc-logos-150
2018-05-29 15:11:06.458227: W tensorflow/core/framework/op_kernel.cc:1318] OP_REQUIRES failed at save_restore_tensor.cc:170 : Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for ./ckpt/tiny-yolo-voc-logos-150
2018-05-29 15:11:06.458929: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-05-29 15:11:06.459025: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-05-29 15:11:06.459056: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0 
2018-05-29 15:11:06.459078: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N 
2018-05-29 15:11:06.459267: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11439 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7)
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1322, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1307, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for ./ckpt/tiny-yolo-voc-logos-150
     [[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/content/drive/CoLab/YOLO/darkflow/darkflow/net/help.py", line 37, in load_from_ckpt
    try: self.saver.restore(self.sess, load_point)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 1802, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 900, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1135, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for ./ckpt/tiny-yolo-voc-logos-150
     [[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

Caused by op 'save/RestoreV2', defined at:
  File "/usr/local/bin/flow", line 6, in <module>
    exec(compile(open(__file__).read(), __file__, 'exec'))
  File "/content/drive/CoLab/YOLO/darkflow/flow", line 6, in <module>
    cliHandler(sys.argv)
  File "/content/drive/CoLab/YOLO/darkflow/darkflow/cli.py", line 26, in cliHandler
    tfnet = TFNet(FLAGS)
  File "/content/drive/CoLab/YOLO/darkflow/darkflow/net/build.py", line 76, in __init__
    self.setup_meta_ops()
  File "/content/drive/CoLab/YOLO/darkflow/darkflow/net/build.py", line 150, in setup_meta_ops
    max_to_keep = self.FLAGS.keep)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 1338, in __init__
    self.build()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 1347, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 1384, in _build
    build_save=build_save, build_restore=build_restore)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 835, in _build_internal
    restore_sequentially, reshape)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 472, in _AddRestoreOps
    restore_sequentially)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 886, in bulk_restore
    return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1463, in restore_v2
    shape_and_slices=shape_and_slices, dtypes=dtypes, name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3392, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 1718, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

NotFoundError (see above for traceback): Unsuccessful TensorSliceReader constructor: Failed to find any matching files for ./ckpt/tiny-yolo-voc-logos-150
     [[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/flow", line 6, in <module>
    exec(compile(open(__file__).read(), __file__, 'exec'))
  File "/content/drive/CoLab/YOLO/darkflow/flow", line 6, in <module>
    cliHandler(sys.argv)
  File "/content/drive/CoLab/YOLO/darkflow/darkflow/cli.py", line 26, in cliHandler
    tfnet = TFNet(FLAGS)
  File "/content/drive/CoLab/YOLO/darkflow/darkflow/net/build.py", line 76, in __init__
    self.setup_meta_ops()
  File "/content/drive/CoLab/YOLO/darkflow/darkflow/net/build.py", line 151, in setup_meta_ops
    if self.FLAGS.load != 0: self.load_from_ckpt()
  File "/content/drive/CoLab/YOLO/darkflow/darkflow/net/help.py", line 38, in load_from_ckpt
    except: load_old_graph(self, load_point)
  File "/content/drive/CoLab/YOLO/darkflow/darkflow/net/help.py", line 49, in load_old_graph
    ckpt_loader = create_loader(ckpt)
  File "/content/drive/CoLab/YOLO/darkflow/darkflow/utils/loader.py", line 105, in create_loader
    return load_type(path, cfg)
  File "/content/drive/CoLab/YOLO/darkflow/darkflow/utils/loader.py", line 19, in __init__
    self.load(*args)
  File "/content/drive/CoLab/YOLO/darkflow/darkflow/utils/loader.py", line 89, in load
    saver = tf.train.import_meta_graph(meta)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 1947, in import_meta_graph
    meta_graph_def = meta_graph.read_meta_graph_file(meta_graph_or_file)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/meta_graph.py", line 632, in read_meta_graph_file
    raise IOError("File %s does not exist." % filename)
OSError: File ./ckpt/tiny-yolo-voc-logos-150.meta does not exist.
youyuge34 commented 6 years ago

@mohamedabdallah1996 I use --load 1500 just as an example, u must change the 1500 to your own steps number. Check the ~/ckpt/ directory to see the saving ckpt files, just find your own steps number inside the .ckpt file name. And i strongly recommend u the tutorial here 👉 youtube

sharoseali commented 6 years ago

youyuge34 .. I am also getting this error and i don't know why .. i have already follow Mark Jay Yolo series i have trained 2000 images but when i use command prompt ( cmd) to test my model i got extreme bad results by specifying threshold = 0.00001.. when specify greater than this value the model don't detect any thing

flow --pbLoad built_graph/tiny-yolo-voc-1c.pb --metaLoad built_graph/tiny-yolo-voc-1c.meta --threshold 0.00001 --imgdir imgess/

i got this result
000021 please help.............

youyuge34 commented 6 years ago

@sharoseali It seems that your training failed. Search your problem in the issues. There are already lots of failure instances like yours. The author recommends that train a few(like 5 or 6) photos first to overfit to see the detect effect which make sure your model can work.

Ata1362 commented 5 years ago

meta

FOR THIS LOSS, high chance that your training is failed. BTW make sure that you are loading the correct weights as well. sometimes the wrong meta or weight file creates this result