gustavz / yolo_for_tf_od_api

Files Added or Updated to be able to use yolo-darknet in tensorflows object detection api
https://github.com/rky0930/models/blob/object_detection_yolo/research/object_detection/README.md
24 stars 7 forks source link

yolo_grid_anchor_generator ->object has no len() #1

Open ilkarman opened 6 years ago

ilkarman commented 6 years ago

I was curious to give this a go and wasn't sure if the below error was due to this being python2 only or something else?

iliauk@ikobjtest:/data/models/research$ python object_detection/train.py     --logtostderr     --pipeline_config_path=/data/tf_od_api/models/yolov2_with_darknet19/pipeline.config      --train_dir=/mnt/trainyolo     --num_clones=2
WARNING:tensorflow:From /data/models/research/object_detection/trainer.py:210: create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.create_global_step
Traceback (most recent call last):
  File "object_detection/train.py", line 163, in <module>
    tf.app.run()
  File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "object_detection/train.py", line 159, in main
    worker_job_name, is_chief, FLAGS.train_dir)
  File "/data/models/research/object_detection/trainer.py", line 228, in train
    clones = model_deploy.create_clones(deploy_config, model_fn, [input_queue])
  File "/data/models/research/slim/deployment/model_deploy.py", line 193, in create_clones
    outputs = model_fn(*args, **kwargs)
  File "/data/models/research/object_detection/trainer.py", line 165, in _create_losses
    prediction_dict = detection_model.predict(images)
  File "/data/models/research/object_detection/meta_architectures/ssd_meta_arch.py", line 270, in predict
    im_width=image_shape[2])
  File "/data/models/research/object_detection/core/anchor_generator.py", line 97, in generate
    len(feature_map_shape_list) != len(self.num_anchors_per_location())):
  File "/data/models/research/object_detection/anchor_generators/yolo_grid_anchor_generator.py", line 114, in num_anchors_per_location
    return [len(self._anchors)]
TypeError: object of type 'zip' has no len()
iliauk@ikobjtest:/data/models/research$

From the function:

  def num_anchors_per_location(self):
    """Returns the number of anchors per spatial location.

    Returns:
      a integer, one for expected feature map to be passed to
      the Generate function.
    """
    return [len(self._anchors)]

I changed to this to make it run:

return [len(list(self._anchors))]

But then ran into another issue:

 File "object_detection/train.py", line 159, in main
    worker_job_name, is_chief, FLAGS.train_dir)
  File "/data/models/research/object_detection/trainer.py", line 228, in train
    clones = model_deploy.create_clones(deploy_config, model_fn, [input_queue])
  File "/data/models/research/slim/deployment/model_deploy.py", line 193, in create_clones
    outputs = model_fn(*args, **kwargs)
  File "/data/models/research/object_detection/trainer.py", line 167, in _create_losses
    losses_dict = detection_model.loss(prediction_dict)
  File "/data/models/research/object_detection/meta_architectures/ssd_meta_arch.py", line 456, in loss
    keypoints)
  File "/data/models/research/object_detection/meta_architectures/ssd_meta_arch.py", line 561, in _assign_targets
    groundtruth_classes_with_background_list)
  File "/data/models/research/object_detection/core/target_assigner.py", line 444, in batch_assign_targets
    anchors, gt_boxes, gt_class_targets)
  File "/data/models/research/object_detection/core/target_assigner.py", line 155, in assign
    match = self._matcher.match(match_quality_matrix, **params)
  File "/data/models/research/object_detection/core/matcher.py", line 194, in match
    return Match(self._match(similarity_matrix, **params))
  File "/data/models/research/object_detection/matchers/argmax_matcher.py", line 175, in _match
    _match_when_rows_are_non_empty, _match_when_rows_are_empty)
  File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/util/deprecation.py", line 316, in new_func
    return func(*args, **kwargs)
  File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 1855, in cond
    orig_res_t, res_t = context_t.BuildCondBranch(true_fn)
  File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 1725, in BuildCondBranch
    original_result = fn()
  File "/data/models/research/object_detection/matchers/argmax_matcher.py", line 159, in _match_when_rows_are_non_empty
    forced_matches_ids = tf.cast(tf.argmax(similarity_matrix, 1), tf.int32)
  File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/util/deprecation.py", line 316, in new_func
    return func(*args, **kwargs)
  File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/ops/math_ops.py", line 205, in argmax
    return gen_math_ops.arg_max(input, axis, name=name, output_type=output_type)
  File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/ops/gen_math_ops.py", line 441, in arg_max
    name=name)
  File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
    op_def=op_def)
  File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Reduction axis 1 is empty in shape [64,0]
         [[Node: Loss/Match_3/cond/ArgMax_1 = ArgMax[T=DT_FLOAT, Tidx=DT_INT32, output_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Loss/Match_3/cond/Shape_2/Switch:1, Loss/Match_3/cond/range/delta)]]
         [[Node: Loss/unstack/_1957 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_4982_Loss/unstack", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
rky0930 commented 6 years ago

Hi. @ilkarman I using python2.7 as well. I want to reproduce this Error. Can you post your config file?

ilkarman commented 6 years ago

Sorry for not being clear - I was using this was py35 and tensorflow-gpu==1.4 and trying to figure out if the error was due to the script being for py2 (since it assumes zip has length) or some other reason. After restarting everything my error changes to OOM which is not true (I reduced batch to 2 and should be able to fit in around 64):

I don't know how it comes up with this shape:

[2,1,3968,2976,3]

Unless it forgets to resize my image? (BTW I've run this dataset on SSD, Faster-RCNN, etc many times on the OD-API)

 File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 775, in train
    sv.stop(threads, close_summary_writer=True)
  File "/anaconda/envs/py35/lib/python3.5/contextlib.py", line 77, in __exit__
    self.gen.throw(type, value, traceback)
  File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/training/supervisor.py", line 964, in managed_session
    self.stop(close_summary_writer=close_summary_writer)
  File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/training/supervisor.py", line 792, in stop
    stop_grace_period_secs=self._stop_grace_secs)
  File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
    six.reraise(*self._exc_info_to_raise)
  File "/anaconda/envs/py35/lib/python3.5/site-packages/six.py", line 693, in reraise
    raise value
  File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/training/queue_runner_impl.py", line 238, in _run
    enqueue_callable()
  File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1231, in _single_operation_run
    target_list_as_strings, status, None)
  File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[2,1,3968,2976,3]
         [[Node: batch = QueueDequeueManyV2[component_types=[DT_STRING, DT_INT32, DT_FLOAT, DT_INT32, DT_FLOAT, DT_INT32, DT_INT64, DT_INT32, DT_INT64, DT_INT32, DT_INT64, DT_INT32, DT_BOOL, DT_INT32, DT_FLOAT, DT_INT32, DT_STRING, DT_INT32, DT_STRING, DT_INT32], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](batch/padding_fifo_queue, batch/n)]]

My config is not really altered in a big way at all:

# YOLO with Darknet19 configuration for MSCOCO Dataset.
# Users should configure the fine_tune_checkpoint field in the train config as
# well as the label_map_path and input_path fields in the train_input_reader and
# eval_input_reader. Search for "PATH_TO_BE_CONFIGURED" to find the fields that
# should be configured.

model {
  ssd {
    num_classes: 1
    box_coder {
      faster_rcnn_box_coder {
        y_scale: 10.0
        x_scale: 10.0
        height_scale: 5.0
        width_scale: 5.0
      }
    }
    matcher {
      argmax_matcher {
        matched_threshold: 0.5
        unmatched_threshold: 0.5
        ignore_thresholds: false
        negatives_lower_than_unmatched: true
        force_match_for_each_row: true
      }
    }
    similarity_calculator {
      iou_similarity {
      }
    }
    anchor_generator {
      yolo_anchor_generator { 
        anchors: [1.3221, 1.73145, 3.19275, 4.00944, 5.05587, 8.09892, 9.47112, 4.84053, 11.2364, 10.0071]
      }
    }
    image_resizer {
      fixed_shape_resizer {
        height: 416
        width: 416
      }
    }
    box_predictor {
      convolutional_box_predictor {
        min_depth: 0
        max_depth: 0
        num_layers_before_predictor: 0
        use_dropout: false
        dropout_keep_probability: 0.8
        kernel_size: 1
        box_code_size: 4
        apply_sigmoid_to_scores: false
        conv_hyperparams {
          activation: NONE,
          regularizer {
            l2_regularizer {
              weight: 0.00004
            }
          }
          initializer {
            truncated_normal_initializer {
              stddev: 0.03
              mean: 0.0
            }
          }
        }
      }
    }
    feature_extractor {
      type: 'yolo_v2_darknet_19'
      conv_hyperparams {
        activation: NONE,
        regularizer {
          l2_regularizer {
            weight: 0.00004
          }
        }
        initializer {
          truncated_normal_initializer {
            stddev: 0.03
            mean: 0.0
          }
        }
        batch_norm {
          train: true,
          scale: true,
          center: true,
          decay: 0.9997,
          epsilon: 0.001,
        }
      }
    }
    loss {
      classification_loss {
        confidence_weighted_sigmoid {
          anchorwise_output: true
          object_scale: 5.0,
          noobject_scale: 1.0
        }
      }
      localization_loss {
        weighted_l2 {
          anchorwise_output: true
        }
      }
      hard_example_miner {
        num_hard_examples: 3000
        iou_threshold: 0.99
        loss_type: CLASSIFICATION
        max_negatives_per_positive: 3
        min_negatives_per_image: 0
      }
      classification_weight: 1.0
      localization_weight: 1.0
    }
    normalize_loss_by_num_matches: true
    post_processing {
      batch_non_max_suppression {
        score_threshold: 1e-8
        iou_threshold: 0.6
        max_detections_per_class: 100
        max_total_detections: 100
      }
      score_converter: SIGMOID
    }
  }
}

train_config: {
  batch_size: 2
  optimizer {
    rms_prop_optimizer: {
      learning_rate: {
        exponential_decay_learning_rate {
          initial_learning_rate: 0.004
          decay_steps: 80072
          decay_factor: 0.95
        }
      }
      momentum_optimizer_value: 0.9
      decay: 0.9
      epsilon: 1.0
    }
  }
  fine_tune_checkpoint: "/data/tf_od_api/models/yolov2_with_darknet19/model.ckpt"
  from_detection_checkpoint: true
  # Note: The below line limits the training process to 200K steps, which we
  # empirically found to be sufficient enough to train the pets dataset. This
  # effectively bypasses the learning rate schedule (the learning rate will
  # never decay). Remove the below line to train indefinitely.
  #num_steps: 800000
  data_augmentation_options {
    random_horizontal_flip {}}

  data_augmentation_options {
    ssd_random_crop {}}
}

train_input_reader: {
  tf_record_input_reader {
    input_path: "/data/tf_od_api/data/all_data_final.record"
  }
  label_map_path: "/data/tf_od_api/data/pipes_map.pbtxt"
}

eval_config: {
  num_examples: 100
  num_visualizations: 100
}

eval_input_reader: {
 tf_record_input_reader {
    input_path: "/data/tf_od_api/data/test_big.record"
  }
  label_map_path: "/data/tf_od_api/data/pipes_map.pbtxt"
  shuffle: false
  num_readers: 1
  num_epochs: 1
}

And I used the checkpoint from rky

rky0930 commented 6 years ago

I think "ResourceExhaustedError" means out of memory.

Image_resizer config will make your imput to size of 416x416. So input size is not a problem.

image_resizer { fixed_shape_resizer { height: 416 width: 416 } }

Batch_size 2 is relatively small. Did you check memory usage of GPU before run this model ? ex) $ nvidia-smi I can run this model with batch_size 8 on my labtop which is GTX1050 (4Gb memory).

ilkarman commented 6 years ago

@rky0930 thanks for the reply! My GPUs are free and available (11GB on 2xK80), the small batch was just to see if it would work at all.

Can you run this model on py35? I've tried re-uploading the files again from this repo, and changing line 114 in yolo_grid_anchor_generator.py to:

return [len(list(self._anchors))]

And get a diff error:

 InvalidArgumentError (see above for traceback): Reduction axis 1 is empty in shape [8,0]
         [[Node: Loss/Match_1/cond/ArgMax_1 = ArgMax[T=DT_FLOAT, Tidx=DT_INT32, output_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Loss/Match_1/cond/Shape_2/Switch:1, Loss/Match_1/cond/range/delta)]]
         [[Node: Loss/ToInt32_2/_375 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1884_Loss/ToInt32_2", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

If I try py27 I get:

    from tensorflow.contrib.learn.python.learn import *
  File "/anaconda/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/__init__.py", line 25, in <module>
    from tensorflow.contrib.learn.python.learn import estimators
  File "/anaconda/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/estimators/__init__.py", line 297, in <module>
    from tensorflow.contrib.learn.python.learn.estimators.dnn import DNNClassifier
  File "/anaconda/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/estimators/dnn.py", line 30, in <module>
    from tensorflow.contrib.learn.python.learn.estimators import dnn_linear_combined
  File "/anaconda/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/estimators/dnn_linear_combined.py", line 31, in <module>
    from tensorflow.contrib.learn.python.learn.estimators import estimator
  File "/anaconda/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 49, in <module>
    from tensorflow.contrib.learn.python.learn.learn_io import data_feeder
  File "/anaconda/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/learn_io/__init__.py", line 21, in <module>
    from tensorflow.contrib.learn.python.learn.learn_io.dask_io import extract_dask_data
  File "/anaconda/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/learn_io/dask_io.py", line 26, in <module>
    import dask.dataframe as dd
  File "/anaconda/lib/python2.7/site-packages/dask/dataframe/__init__.py", line 3, in <module>
    from .core import (DataFrame, Series, Index, _Frame, map_partitions,
  File "/anaconda/lib/python2.7/site-packages/dask/dataframe/core.py", line 40, in <module>
    pd.core.computation.expressions.set_use_numexpr(False)
AttributeError: 'module' object has no attribute 'expressions'

In both cases tf==1.4.0

rky0930 commented 6 years ago

I changed the line 114 the same as you and run using python2.7. There was no such error.

In my knowledge, Reduction axis 1 is empty in shape [8,0] error is occured when the tf.arg_max ops get empty tensor in the specific axis. I suggest you check your record and pbtxt file again.
Or, please share the files to me via email. so, i can try to reproduce the error.

I guess python2.7 error is related with post ? https://stackoverflow.com/questions/43833081/attributeerror-module-object-has-no-attribute-computation

I hope your model work well. If the problem occured again. Please let me know :smiley:

iyeszin commented 6 years ago

I get the same error as @ilkarman, and I changes to return [len(list(self._anchors))] in function, however, another error occurs on it.

Traceback (most recent call last): File "train.py", line 167, in tf.app.run() File "C:\Users\Lenovo\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\platform\app.py", line 126, in run _sys.exit(main(argv)) File "train.py", line 163, in main worker_job_name, is_chief, FLAGS.train_dir) File "C:\Users\Lenovo\Desktop\testTest\research\object_detection\trainer.py", line 246, in train clones = model_deploy.create_clones(deploy_config, model_fn, [input_queue]) File "C:\Users\Lenovo\Desktop\testTest\research\slim\deployment\model_deploy.py", line 193, in create_clones outputs = model_fn(*args, **kwargs) File "C:\Users\Lenovo\Desktop\testTest\research\object_detection\trainer.py", line 179, in _create_losses prediction_dict = detection_model.predict(images, true_image_shapes) File "C:\Users\Lenovo\Desktop\testTest\research\object_detection\meta_architectures\ssd_meta_arch.py", line 341, in predict im_width=image_shape[2]) File "C:\Users\Lenovo\Desktop\testTest\research\object_detection\core\anchor_generator.py", line 107, in generate anchors_list, feature_map_shape_list)]): File "C:\Users\Lenovo\Desktop\testTest\research\object_detection\core\anchor_generator.py", line 144, in _assert_correct_number_of_anchors self.num_anchors_per_location(), feature_map_shape_list, anchors_list): TypeError: zip argument #3 must support iteration

I'm using py36 and running on gpu. Im using custom data on 2 classes.

I have no idea to solve that error, please help. Many thanks!

rky0930 commented 6 years ago

Hi. @iyeszin It happened when you use yolo code with newest object detection api code. I fixed the bugs with simply make the return value to list. Please try again. You just need to re-download "yolo_grid_anchor_generator.py" and run again. If another bugs occurred. please let me know.

xiaowenhe commented 5 years ago

@iyeszin @rky0930 。I also get Reduction axis 1 is empty in shape [2,0] error! I fixed the error like :

self._anchors = zip([iter(anchors)]2) comment this line, and then use follows.

  self._anchors = []
  for i ,  j in enumerate(anchors):
    if i % 2 == 0:
      self._anchors.append([anchors[i],anchors[i+1]])

in yolo_grid_anchor_generator.py line 86.