wrkgm commented 5 years ago

System information

What is the top-level directory of the model you are using: lstm_object_detection
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Trying to
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10
TensorFlow installed from (source or binary): source
TensorFlow version (use command below): 1.12.0
Bazel version (if compiling from source):
CUDA/cuDNN version: 9.0
GPU model and memory: GTX 1070 ti
Exact command to reproduce: python train.py --train_dir=training --pipeline_config_path=configs/lstm_ssd_mobilenet_v1_imagenet.config

Describe the problem

Training the LSTM object detection model does not work. After making a tfrecord, modifying the config as necessary, creating a training dir, and running the command, I get this error:

tensorflow.python.framework.errors_impl.InvalidArgumentError: Tried to explicitly squeeze dimension 0 but dimension was not 1: 0
         [[Node: Squeeze_1 = Squeeze[T=DT_INT64, squeeze_dims=[0], _device="/job:localhost/replica:0/task:0/device:CPU:0"](split_2)]]

More documentation, including a simple example file of how to make a tfrecord and train, would be very helpful. I have tried two ways to create a tfrecord, both of which are shown below. I thought maybe the record structure is wrong, but if I put a typo in the record keys, I get a different error complaining about that, so perhaps I structured the records correctly. I tried looking at the model in tensorboard and modifying the training code in slim/learning.py to fetch values from individual nodes near Squeeze_1. I print the node, the output, and the shape of the output. Here are results from these attempts:

try run:  split_1:0
Tensor("split_1:0", shape=(?, ?, 4), dtype=float32, device=/device:CPU:0)
value: []
size: (0, 0, 4)

try run:  ParseSingleSequenceExample/ParseSingleSequenceExample:0
Tensor("ParseSingleSequenceExample/ParseSingleSequenceExample:0", shape=(), dtype=string, device=/device:CPU:0)
value: b''
size: ()

try run:  ResizeImage/resize_images/ResizeBilinear:0
Tensor("ResizeImage/resize_images/ResizeBilinear:0", shape=(4, 256, 256, 3), dtype=float32, device=/device:CPU:0)
value: (big numpy array)
size: (4, 256, 256, 3)

It seems that split_1 and ParseSingleSequenceExample are not actually receiving any data, and thus cause this squeeze error since there is nothing to squeeze. But resize image still gets data. Additionally, if I ONLY fetch ResizeImage/resize_images/ResizeBilinear:0, I can fetch it a couple of times (repeatedly fetching in a loop), and then it fails. Perhaps the model fails after one batch?

I'm not sure if this counts a duplicate, but here are some related threads: https://github.com/tensorflow/models/issues/6027 https://github.com/tensorflow/models/issues/5869 https://stackoverflow.com/questions/54093931/lstm-object-detection-tensorflow

I've also emailed the authors and heard nothing back.

EDIT:

I should mention, I removed ssd_random_crop from data augmentation options in the config because it was giving me an error "the function ssd_random_crop requires argument groundtruth_weights" Not sure if this would matter at all

Source code / logs

I tried two ways of creating tfrecords. The first was taken from tf_sequence_example_decoder_test.py, in this repo. The only change was swapping to sequences of length 4 to match the config file.

writer = tf.python_io.TFRecordWriter(path)
with tf.Session() as sess:
    for _ in range(2000):
        image_tensor = np.random.randint(255, size=(16, 16, 3)).astype(np.uint8)
        print(image_tensor)

        encoded_jpeg = tf.image.encode_jpeg(tf.constant(image_tensor)).eval()

        sequence_example = example_pb2.SequenceExample(
            context=feature_pb2.Features(
                feature={
                    'image/format':
                        feature_pb2.Feature(
                            bytes_list=feature_pb2.BytesList(
                                value=['jpeg'.encode('utf-8')])),
                    'image/height':
                        feature_pb2.Feature(
                            int64_list=feature_pb2.Int64List(value=[16])),
                    'image/width':
                        feature_pb2.Feature(
                            int64_list=feature_pb2.Int64List(value=[16])),
                }),
            feature_lists=feature_pb2.FeatureLists(
                feature_list={
                    'image/encoded':
                        feature_pb2.FeatureList(feature=[
                            feature_pb2.Feature(
                                bytes_list=feature_pb2.BytesList(
                                    value=[encoded_jpeg])), feature_pb2.Feature(
                                bytes_list=feature_pb2.BytesList(
                                    value=[encoded_jpeg])), feature_pb2.Feature(
                                bytes_list=feature_pb2.BytesList(
                                    value=[encoded_jpeg])), feature_pb2.Feature(
                                bytes_list=feature_pb2.BytesList(
                                    value=[encoded_jpeg]))
                        ]),
                    'image/object/bbox/xmin':
                        feature_pb2.FeatureList(feature=[
                            feature_pb2.Feature(
                                float_list=feature_pb2.FloatList(value=[0.0])),
                            feature_pb2.Feature(
                                float_list=feature_pb2.FloatList(value=[0.0])),
                            feature_pb2.Feature(
                                float_list=feature_pb2.FloatList(value=[0.0])),
                            feature_pb2.Feature(
                                float_list=feature_pb2.FloatList(value=[0.0]))
                        ]),
                    'image/object/bbox/xmax':
                        feature_pb2.FeatureList(feature=[
                            feature_pb2.Feature(
                                float_list=feature_pb2.FloatList(value=[1.0])),
                            feature_pb2.Feature(
                                float_list=feature_pb2.FloatList(value=[1.0])),
                            feature_pb2.Feature(
                                float_list=feature_pb2.FloatList(value=[1.0])),
                            feature_pb2.Feature(
                                float_list=feature_pb2.FloatList(value=[1.0]))
                        ]),
                    'image/object/bbox/ymin':
                        feature_pb2.FeatureList(feature=[
                            feature_pb2.Feature(
                                float_list=feature_pb2.FloatList(value=[0.0])),
                            feature_pb2.Feature(
                                float_list=feature_pb2.FloatList(value=[0.0])),
                            feature_pb2.Feature(
                                float_list=feature_pb2.FloatList(value=[0.0])),
                            feature_pb2.Feature(
                                float_list=feature_pb2.FloatList(value=[0.0]))
                        ]),
                    'image/object/bbox/ymax':
                        feature_pb2.FeatureList(feature=[
                            feature_pb2.Feature(
                                float_list=feature_pb2.FloatList(value=[1.0])),
                            feature_pb2.Feature(
                                float_list=feature_pb2.FloatList(value=[1.0])),
                            feature_pb2.Feature(
                                float_list=feature_pb2.FloatList(value=[1.0])),
                            feature_pb2.Feature(
                                float_list=feature_pb2.FloatList(value=[1.0]))
                        ]),
                    'image/object/class/label':
                        feature_pb2.FeatureList(feature=[
                            feature_pb2.Feature(
                                int64_list=feature_pb2.Int64List(value=[1])),
                            feature_pb2.Feature(
                                int64_list=feature_pb2.Int64List(value=[1])),
                            feature_pb2.Feature(
                                int64_list=feature_pb2.Int64List(value=[1])),
                            feature_pb2.Feature(
                                int64_list=feature_pb2.Int64List(value=[1]))
                        ]),
                }))

        writer.write(sequence_example.SerializeToString())
writer.close()

I also tried adapting a method I found here: https://github.com/wakanda-ai/tf-detectors For this I used a couple sample xml files in PASCAL VOC format from a training set I have for one of the normal object_detection models.

    # Iterate frames
    for data, img_path in zip(dicts, imgs_path):
        ## open single frame
        with tf.gfile.FastGFile(img_path, 'rb') as fid:
            encoded_jpg = fid.read()
        encoded_jpg_io = io.BytesIO(encoded_jpg)
        image = Image.open(encoded_jpg_io)
        if image.format != 'JPEG':
            raise ValueError('Image format not JPEG')
        key = hashlib.sha256(encoded_jpg).hexdigest()

        ## validation
        assert int(data['size']['height']) == height
        assert int(data['size']['width']) == width

        ## iterate objects
        xmin, ymin = [], []
        xmax, ymax = [], []
        name = []
        classval =  []
        occluded = []
        generated = []
        if 'object' in data:
            for obj in data['object']:
                xmin.append(float(obj['bndbox']['xmin']) / width)
                ymin.append(float(obj['bndbox']['ymin']) / height)
                xmax.append(float(obj['bndbox']['xmax']) / width)
                ymax.append(float(obj['bndbox']['ymax']) / height)
                name.append(obj['name'].encode('utf8'))
                classval.append(1)
                occluded.append(0)
                generated.append(0)
        else:
            xmin.append(float(-1))
            ymin.append(float(-1))
            xmax.append(float(-1))
            ymax.append(float(-1))
            name.append('NoObject'.encode('utf8'))
            classval.append(1)
            occluded.append(0)
            generated.append(0)

        ## append tf_feature to list
        filenames.append(dataset_util.bytes_feature(data['filename'].encode('utf8')))
        encodeds.append(dataset_util.bytes_feature(encoded_jpg))
        sources.append(dataset_util.bytes_feature(data['source']['database'].encode('utf8')))
        keys.append(dataset_util.bytes_feature(key.encode('utf8')))
        formats.append(dataset_util.bytes_feature('jpeg'.encode('utf8')))
        xmins.append(dataset_util.float_list_feature(xmin))
        ymins.append(dataset_util.float_list_feature(ymin))
        xmaxs.append(dataset_util.float_list_feature(xmax))
        ymaxs.append(dataset_util.float_list_feature(ymax))
        names.append(dataset_util.bytes_list_feature(name))
        occludeds.append(dataset_util.int64_list_feature(occluded))
        generateds.append(dataset_util.int64_list_feature(generated))
        class_labels.append(dataset_util.int64_list_feature(classval))

    # Non sequential features
    context = tf.train.Features(feature={
        'video/folder': dataset_util.bytes_feature(folder.encode('utf8')),
        'video/frame_number': dataset_util.int64_feature(len(imgs_path)),
        'video/height': dataset_util.int64_feature(height),
        'video/width': dataset_util.int64_feature(width),
        })
    # Sequential features
    tf_feature_lists = {
        'image/filename': tf.train.FeatureList(feature=filenames),
        'image/encoded': tf.train.FeatureList(feature=encodeds),
        'image/sources': tf.train.FeatureList(feature=sources),
        'image/key/sha256': tf.train.FeatureList(feature=keys),
        'image/format': tf.train.FeatureList(feature=formats),
        'image/object/bbox/xmin': tf.train.FeatureList(feature=xmins),
        'image/object/bbox/xmax': tf.train.FeatureList(feature=xmaxs),
        'image/object/bbox/ymin': tf.train.FeatureList(feature=ymins),
        'image/object/bbox/ymax': tf.train.FeatureList(feature=ymaxs),
        'image/object/class/text': tf.train.FeatureList(feature=names),
        'image/object/class/label': tf.train.FeatureList(feature=class_labels),
        'image/object/occluded': tf.train.FeatureList(feature=occludeds),
        'image/object/generated': tf.train.FeatureList(feature=generateds),
        }
    feature_lists = tf.train.FeatureLists(feature_list=tf_feature_lists)
    # Make single sequence example
    tf_example = tf.train.SequenceExample(context=context, feature_lists=feature_lists)
    return tf_example

Tfrecords created with both of these approaches yielded identical errors.

whasyt commented 5 years ago

I‘ve met the same "squeeze" issue with @wrkgm .

I tried to comment the data augmentation options in config:

#  data_augmentation_options {
#    random_horizontal_flip {
#    }
#  }
#  data_augmentation_options {
#    ssd_random_crop {
#      groundtruth_weights: 1.0
#    }
#  }

then meet the error:

root@747-Super-Server:~/tensorflow/models/research# python lstm_object_detection/train.py --train_dir=/home1/lstmDetection/model --pipeline_config_path=lstm_object_detection/configs/lstm_ssd_mobilenet_v1_imagenet.config
.....
tensorflow.python.framework.errors_impl.InvalidArgumentError: assertion failed: [All sequence lengths must match, but received lengths: 0 All sequence lengths must match, but received lengths: 0 All sequence lengths must match, but received lengths: 1]
         [[Node: batch_sequences_with_states/Assert_2/Assert = Assert[T=[DT_STRING], summarize=3, _device="/job:localhost/replica:0/task:0/device:CPU:0"](batch_sequences_with_states/Equal_2, batch_sequences_with_states/StringJoin)]]

ashkanee commented 5 years ago

System information

What is the top-level directory of the model you are using: lstm_object_detection
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Trying to
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10
TensorFlow installed from (source or binary): source
TensorFlow version (use command below): 1.12.0
Bazel version (if compiling from source):
CUDA/cuDNN version: 9.0
GPU model and memory: GTX 1070 ti
Exact command to reproduce: python train.py --train_dir=training --pipeline_config_path=configs/lstm_ssd_mobilenet_v1_imagenet.config

Describe the problem

Training the LSTM object detection model does not work. After making a tfrecord, modifying the config as necessary, creating a training dir, and running the command, I get this error:

tensorflow.python.framework.errors_impl.InvalidArgumentError: Tried to explicitly squeeze dimension 0 but dimension was not 1: 0
         [[Node: Squeeze_1 = Squeeze[T=DT_INT64, squeeze_dims=[0], _device="/job:localhost/replica:0/task:0/device:CPU:0"](split_2)]]

More documentation, including a simple example file of how to make a tfrecord and train, would be very helpful. I have tried two ways to create a tfrecord, both of which are shown below. I thought maybe the record structure is wrong, but if I put a typo in the record keys, I get a different error complaining about that, so perhaps I structured the records correctly. I tried looking at the model in tensorboard and modifying the training code in slim/learning.py to fetch values from individual nodes near Squeeze_1. I print the node, the output, and the shape of the output. Here are results from these attempts:

try run:  split_1:0
Tensor("split_1:0", shape=(?, ?, 4), dtype=float32, device=/device:CPU:0)
value: []
size: (0, 0, 4)

try run:  ParseSingleSequenceExample/ParseSingleSequenceExample:0
Tensor("ParseSingleSequenceExample/ParseSingleSequenceExample:0", shape=(), dtype=string, device=/device:CPU:0)
value: b''
size: ()

try run:  ResizeImage/resize_images/ResizeBilinear:0
Tensor("ResizeImage/resize_images/ResizeBilinear:0", shape=(4, 256, 256, 3), dtype=float32, device=/device:CPU:0)
value: (big numpy array)
size: (4, 256, 256, 3)

It seems that split_1 and ParseSingleSequenceExample are not actually receiving any data, and thus cause this squeeze error since there is nothing to squeeze. But resize image still gets data. Additionally, if I ONLY fetch ResizeImage/resize_images/ResizeBilinear:0, I can fetch it a couple of times (repeatedly fetching in a loop), and then it fails. Perhaps the model fails after one batch?

I'm not sure if this counts a duplicate, but here are some related threads:

6027

5869

https://stackoverflow.com/questions/54093931/lstm-object-detection-tensorflow

I've also emailed the authors and heard nothing back.

Source code / logs

I tried two ways of creating tfrecords. The first was taken from tf_sequence_example_decoder_test.py, in this repo. The only change was swapping to sequences of length 4 to match the config file.

writer = tf.python_io.TFRecordWriter(path)
with tf.Session() as sess:
    for _ in range(2000):
        image_tensor = np.random.randint(255, size=(16, 16, 3)).astype(np.uint8)
        print(image_tensor)

        encoded_jpeg = tf.image.encode_jpeg(tf.constant(image_tensor)).eval()

        sequence_example = example_pb2.SequenceExample(
            context=feature_pb2.Features(
                feature={
                    'image/format':
                        feature_pb2.Feature(
                            bytes_list=feature_pb2.BytesList(
                                value=['jpeg'.encode('utf-8')])),
                    'image/height':
                        feature_pb2.Feature(
                            int64_list=feature_pb2.Int64List(value=[16])),
                    'image/width':
                        feature_pb2.Feature(
                            int64_list=feature_pb2.Int64List(value=[16])),
                }),
            feature_lists=feature_pb2.FeatureLists(
                feature_list={
                    'image/encoded':
                        feature_pb2.FeatureList(feature=[
                            feature_pb2.Feature(
                                bytes_list=feature_pb2.BytesList(
                                    value=[encoded_jpeg])), feature_pb2.Feature(
                                bytes_list=feature_pb2.BytesList(
                                    value=[encoded_jpeg])), feature_pb2.Feature(
                                bytes_list=feature_pb2.BytesList(
                                    value=[encoded_jpeg])), feature_pb2.Feature(
                                bytes_list=feature_pb2.BytesList(
                                    value=[encoded_jpeg]))
                        ]),
                    'image/object/bbox/xmin':
                        feature_pb2.FeatureList(feature=[
                            feature_pb2.Feature(
                                float_list=feature_pb2.FloatList(value=[0.0])),
                            feature_pb2.Feature(
                                float_list=feature_pb2.FloatList(value=[0.0])),
                            feature_pb2.Feature(
                                float_list=feature_pb2.FloatList(value=[0.0])),
                            feature_pb2.Feature(
                                float_list=feature_pb2.FloatList(value=[0.0]))
                        ]),
                    'image/object/bbox/xmax':
                        feature_pb2.FeatureList(feature=[
                            feature_pb2.Feature(
                                float_list=feature_pb2.FloatList(value=[1.0])),
                            feature_pb2.Feature(
                                float_list=feature_pb2.FloatList(value=[1.0])),
                            feature_pb2.Feature(
                                float_list=feature_pb2.FloatList(value=[1.0])),
                            feature_pb2.Feature(
                                float_list=feature_pb2.FloatList(value=[1.0]))
                        ]),
                    'image/object/bbox/ymin':
                        feature_pb2.FeatureList(feature=[
                            feature_pb2.Feature(
                                float_list=feature_pb2.FloatList(value=[0.0])),
                            feature_pb2.Feature(
                                float_list=feature_pb2.FloatList(value=[0.0])),
                            feature_pb2.Feature(
                                float_list=feature_pb2.FloatList(value=[0.0])),
                            feature_pb2.Feature(
                                float_list=feature_pb2.FloatList(value=[0.0]))
                        ]),
                    'image/object/bbox/ymax':
                        feature_pb2.FeatureList(feature=[
                            feature_pb2.Feature(
                                float_list=feature_pb2.FloatList(value=[1.0])),
                            feature_pb2.Feature(
                                float_list=feature_pb2.FloatList(value=[1.0])),
                            feature_pb2.Feature(
                                float_list=feature_pb2.FloatList(value=[1.0])),
                            feature_pb2.Feature(
                                float_list=feature_pb2.FloatList(value=[1.0]))
                        ]),
                    'image/object/class/label':
                        feature_pb2.FeatureList(feature=[
                            feature_pb2.Feature(
                                int64_list=feature_pb2.Int64List(value=[1])),
                            feature_pb2.Feature(
                                int64_list=feature_pb2.Int64List(value=[1])),
                            feature_pb2.Feature(
                                int64_list=feature_pb2.Int64List(value=[1])),
                            feature_pb2.Feature(
                                int64_list=feature_pb2.Int64List(value=[1]))
                        ]),
                }))

        writer.write(sequence_example.SerializeToString())
writer.close()

I also tried adapting a method I found here: https://github.com/wakanda-ai/tf-detectors For this I used a couple sample xml files in PASCAL VOC format from a training set I have for one of the normal object_detection models.

    # Iterate frames
    for data, img_path in zip(dicts, imgs_path):
        ## open single frame
        with tf.gfile.FastGFile(img_path, 'rb') as fid:
            encoded_jpg = fid.read()
        encoded_jpg_io = io.BytesIO(encoded_jpg)
        image = Image.open(encoded_jpg_io)
        if image.format != 'JPEG':
            raise ValueError('Image format not JPEG')
        key = hashlib.sha256(encoded_jpg).hexdigest()

        ## validation
        assert int(data['size']['height']) == height
        assert int(data['size']['width']) == width

        ## iterate objects
        xmin, ymin = [], []
        xmax, ymax = [], []
        name = []
        classval =  []
        occluded = []
        generated = []
        if 'object' in data:
            for obj in data['object']:
                xmin.append(float(obj['bndbox']['xmin']) / width)
                ymin.append(float(obj['bndbox']['ymin']) / height)
                xmax.append(float(obj['bndbox']['xmax']) / width)
                ymax.append(float(obj['bndbox']['ymax']) / height)
                name.append(obj['name'].encode('utf8'))
                classval.append(1)
                occluded.append(0)
                generated.append(0)
        else:
            xmin.append(float(-1))
            ymin.append(float(-1))
            xmax.append(float(-1))
            ymax.append(float(-1))
            name.append('NoObject'.encode('utf8'))
            classval.append(1)
            occluded.append(0)
            generated.append(0)

        ## append tf_feature to list
        filenames.append(dataset_util.bytes_feature(data['filename'].encode('utf8')))
        encodeds.append(dataset_util.bytes_feature(encoded_jpg))
        sources.append(dataset_util.bytes_feature(data['source']['database'].encode('utf8')))
        keys.append(dataset_util.bytes_feature(key.encode('utf8')))
        formats.append(dataset_util.bytes_feature('jpeg'.encode('utf8')))
        xmins.append(dataset_util.float_list_feature(xmin))
        ymins.append(dataset_util.float_list_feature(ymin))
        xmaxs.append(dataset_util.float_list_feature(xmax))
        ymaxs.append(dataset_util.float_list_feature(ymax))
        names.append(dataset_util.bytes_list_feature(name))
        occludeds.append(dataset_util.int64_list_feature(occluded))
        generateds.append(dataset_util.int64_list_feature(generated))
        class_labels.append(dataset_util.int64_list_feature(classval))

    # Non sequential features
    context = tf.train.Features(feature={
        'video/folder': dataset_util.bytes_feature(folder.encode('utf8')),
        'video/frame_number': dataset_util.int64_feature(len(imgs_path)),
        'video/height': dataset_util.int64_feature(height),
        'video/width': dataset_util.int64_feature(width),
        })
    # Sequential features
    tf_feature_lists = {
        'image/filename': tf.train.FeatureList(feature=filenames),
        'image/encoded': tf.train.FeatureList(feature=encodeds),
        'image/sources': tf.train.FeatureList(feature=sources),
        'image/key/sha256': tf.train.FeatureList(feature=keys),
        'image/format': tf.train.FeatureList(feature=formats),
        'image/object/bbox/xmin': tf.train.FeatureList(feature=xmins),
        'image/object/bbox/xmax': tf.train.FeatureList(feature=xmaxs),
        'image/object/bbox/ymin': tf.train.FeatureList(feature=ymins),
        'image/object/bbox/ymax': tf.train.FeatureList(feature=ymaxs),
        'image/object/class/text': tf.train.FeatureList(feature=names),
        'image/object/class/label': tf.train.FeatureList(feature=class_labels),
        'image/object/occluded': tf.train.FeatureList(feature=occludeds),
        'image/object/generated': tf.train.FeatureList(feature=generateds),
        }
    feature_lists = tf.train.FeatureLists(feature_list=tf_feature_lists)
    # Make single sequence example
    tf_example = tf.train.SequenceExample(context=context, feature_lists=feature_lists)
    return tf_example

Tfrecords created with both of these approaches yielded identical errors.

@wrkgm I used this code to convert ImageNet VID 2015 to tfrecord, but get a different error:

tensorflow.python.framework.errors_impl.InvalidArgumentError: Can not squeeze dim[0], expected a dimension of 1, got 0 [[{{node Squeeze_1}} = Squeeze[T=DT_INT64, squeeze_dims=[0], _device="/job:localhost/replica:0/task:0/device:CPU:0"](split_2)]]

ashkanee commented 5 years ago

One important point, I also faced an error saying:

"the function ssd_random_crop requires argument groundtruth_weights"

I fixed it by adding the "groundtruth_weights" and putting it equal to "None".

Did you face the similar error? I thought that the current error might relate to it.

wrkgm commented 5 years ago

@ashkanee Good point! I forgot I did face this error as well. I simply removed ssd_random_crop entirely from data augmentation options. Will edit main post to mention this, thanks

ashkanee commented 5 years ago

@ashkanee Good point! I forgot I did face this error as well. I simply removed ssd_random_crop entirely from data augmentation options. Will edit main post to mention this, thanks

I added the groundtruth_weights for ssd_random_crop and fixed the bug, but got the same error as you. Based on this, I guess that may not be the source of the problem.

ashkanee commented 5 years ago

@dreamdragon Also, one more observation:

1- If I use the check point, I get the following error:

tensorflow.python.framework.errors_impl.InvalidArgumentError: Number of ways to split should evenly divide the split dimension, but got split_dim 0 (size = 10) and num_split 4 [[{{node split}} = Split[T=DT_UINT8, num_split=4, _device="/job:localhost/replica:0/task:0/device:CPU:0"](batch_sequences_with_states/InputQueueingStateSaver/ExpandDims_1/dim, map/TensorArrayStack/TensorArrayGatherV3)]]

which seems that relates to GPU issues.

2- If I do not use the check point, I get the similar error as @wrkgm :

tensorflow.python.framework.errors_impl.InvalidArgumentError: Can not squeeze dim[0], expected a dimension of 1, got 0 [[{{node Squeeze_1}} = Squeeze[T=DT_INT64, squeeze_dims=[0], _device="/job:localhost/replica:0/task:0/device:CPU:0"](split_2)]]

Based on the above, it seems your issue relates to the missing check point.

Important point: Do you get warning on deprecated functions? I get them. May be the error is caused by deprecated functions in TensorFlow. This is probably worth working on.

@wrkgm Can I ask you the following:

1- Can you please check what happens if you add the check point? I use the check point weights here

2- Is it possible for you to explain how you used "tf_sequence_example_decoder_test.py" to generate tfrecord files? Does it give separate training and validation files? Using this mixes the training and validation files.

Thanks!

Edit: I am trying to train on one GPU: Tesla V100-SXM2-16GB, it seems that this issue relates when training using more than one GPU. There are more infomration here: #266

update: the error s most probably do not related to GPU

Edit: Last observation: The errors are not related to check point and based on my experiments are based on the preprocessing part on data augmentation since empty tensors are passed on. This may caused by the way that data are converted to tfrecord.

ashkanee commented 5 years ago

@wrkgm Do you get the warning:

WARNING:tensorflow:From /home/ashkan/models/research/object_detection/core/preprocessor.py:1218: calling squeeze (from tensorflow.python.ops.array_ops) with squeeze_dims is deprecated and will be removed in a future version. Instructions for updating: Use theaxisargument instead

I guess it may help.

Edit: Fixing this still results in the error.

ashkanee commented 5 years ago

I looked into the tensors in the file seq_dataset_builder.py and noticed that the following tensors are empty:

tensor_dict['groundtruth_boxes'] tensor_dict['groundtruth_classes']

In addition, I get one of the following errors each time (they appear in random order which seems to be related in the randomness of data augmentation part since they go away if you comment the data augmentation in the config file)

tensorflow.python.framework.errors_impl.InvalidArgumentError: Can not squeeze dim[0], expected a dimension of 1, got 0 [[{{node Squeeze_10}} = Squeeze[T=DT_INT64, squeeze_dims=[0], _device="/job:localhost/replica:0/task:0/device:CPU:0"](strided_slice_7)]]

tensorflow.python.framework.errors_impl.InvalidArgumentError: Number of ways to split should evenly divide the split dimension, but got split_dim 0 (size = 10) and num_split 4 [[{{node split}} = Split[T=DT_UINT8, num_split=4, _device="/job:localhost/replica:0/task:0/device:CPU:0"](batch_sequences_with_states/InputQueueingStateSaver/ExpandDims_1/dim, Print)]]

There are commands afterwards as tf.split and tf.squeeze and my understanding is that the errors are caused since empty tensors are being spitted or squeezed.

wrkgm commented 5 years ago

@ashkanee I just modified the code from that file to generate a tfrecord full of randomly generated numpy arrays. I pointed the train path config at this. Haven't worried about splitting train and test yet, I'm just trying to get the code to run for anything at all. I tried with and without checkpoint and got similar error.

It's almost certainly not related to the GPU. I've got a GTX 1070ti. And I get that same warning. I agree it likely has something to do with the input pipeline or preprocessing steps.

Aaronreb commented 5 years ago

While running train.py file from lstm_object_detection, I get the following error-

Traceback (most recent call last): File "lstm_object_detection/train.py", line 185, in tf.app.run() File "/home/kt-ml1/.local/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "lstm_object_detection/train.py", line 94, in main FLAGS.pipeline_config_path) File "/home/kt-ml1/models-master/models-master/research/lstm_object_detection/utils/config_util.py", line 46, in get_configs_from_pipeline_file text_format.Merge(proto_str, pipeline_config) File "/home/kt-ml1/.local/lib/python3.6/site-packages/google/protobuf/text_format.py", line 574, in Merge descriptor_pool=descriptor_pool) File "/home/kt-ml1/.local/lib/python3.6/site-packages/google/protobuf/text_format.py", line 631, in MergeLines return parser.MergeLines(lines, message) File "/home/kt-ml1/.local/lib/python3.6/site-packages/google/protobuf/text_format.py", line 654, in MergeLines self._ParseOrMerge(lines, message) File "/home/kt-ml1/.local/lib/python3.6/site-packages/google/protobuf/text_format.py", line 676, in _ParseOrMerge self._MergeField(tokenizer, message) File "/home/kt-ml1/.local/lib/python3.6/site-packages/google/protobuf/text_format.py", line 735, in _MergeField 'that message\'s _pb2 module must be imported as well' % name) google.protobuf.text_format.ParseError: 18:26 : Extension "object_detection.protos.lstm_model" not registered. Did you import the _pb2 module which defines it? If you are trying to place the extension in the MessageSet field of another message that is in an Any or MessageSet field, that message's _pb2 module must be imported as well

I figured out that there is something wrong with the config file in the lstm_object_detection. Can someone please help me understand what changes do we need to do in the config file to run it successfully.

dreamdragon commented 5 years ago

Replace object_detection.protos.lstm_model with lstm_object_detection.protos.lstm_model in the config.

We will fix this issue in the codebase shortly.

Aaronreb commented 5 years ago

Replace object_detection.protos.lstm_model with lstm_object_detection.protos.lstm_model in the config.

We will fix this issue in the codebase shortly.

Done. Thanks. But, this got me another error as follows:-

TypeError: Expected binary or unicode string, got <object_detection.core.matcher.Match object at 0x7f5b089379b0>

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "lstm_object_detection/train.py", line 185, in tf.app.run() File "/home/kt-ml1/.local/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "lstm_object_detection/train.py", line 181, in main worker_job_name, is_chief, FLAGS.train_dir) File "/home/kt-ml1/models-master/models-master/research/lstm_object_detection/trainer.py", line 293, in train clones = model_deploy.create_clones(deploy_config, model_fn, [input_queue]) File "/home/kt-ml1/models-master/models-master/research/slim/deployment/model_deploy.py", line 193, in create_clones outputs = model_fn(*args, **kwargs) File "/home/kt-ml1/models-master/models-master/research/lstm_object_detection/trainer.py", line 174, in _create_losses losses_dict = detection_model.loss(prediction_dict, true_image_shapes) File "/home/kt-ml1/models-master/models-master/research/lstm_object_detection/meta_architectures/lstm_ssd_meta_arch.py", line 165, in loss match_list = [matcher.Match(match) for match in tf.unstack(batch_match)] File "/home/kt-ml1/.local/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 1149, in unstack value = ops.convert_to_tensor(value) File "/home/kt-ml1/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1039, in convert_to_tensor return convert_to_tensor_v2(value, dtype, preferred_dtype, name) File "/home/kt-ml1/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1097, in convert_to_tensor_v2 as_ref=False) File "/home/kt-ml1/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1175, in internal_convert_to_tensor ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref) File "/home/kt-ml1/.local/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py", line 304, in _constant_tensor_conversion_function return constant(v, dtype=dtype, name=name) File "/home/kt-ml1/.local/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py", line 245, in constant allow_broadcast=True) File "/home/kt-ml1/.local/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py", line 283, in _constant_impl allow_broadcast=allow_broadcast)) File "/home/kt-ml1/.local/lib/python3.6/site-packages/tensorflow/python/framework/tensor_util.py", line 562, in make_tensor_proto "supported type." % (type(values), values)) TypeError: Failed to convert object of type <class 'list'> to Tensor. Contents: [<object_detection.core.matcher.Match object at 0x7f5b089379b0>, <object_detection.core.matcher.Match object at 0x7f5b087b75c0>, <object_detection.core.matcher.Match object at 0x7f5b08626fd0>, <object_detection.core.matcher.Match object at 0x7f5b084a9b00>, <object_detection.core.matcher.Match object at 0x7f5b083302b0>, <object_detection.core.matcher.Match object at 0x7f5b0819a6a0>, <object_detection.core.matcher.Match object at 0x7f5b03fe2710>, <object_detection.core.matcher.Match object at 0x7f5b03e672b0>, <object_detection.core.matcher.Match object at 0x7f5b03cd36a0>, <object_detection.core.matcher.Match object at 0x7f5b03b53710>, <object_detection.core.matcher.Match object at 0x7f5b039d92b0>, <object_detection.core.matcher.Match object at 0x7f5b038456a0>, <object_detection.core.matcher.Match object at 0x7f5b036ca710>, <object_detection.core.matcher.Match object at 0x7f5b0354e2b0>, <object_detection.core.matcher.Match object at 0x7f5b034396a0>, <object_detection.core.matcher.Match object at 0x7f5b032bb710>, <object_detection.core.matcher.Match object at 0x7f5b0313f2b0>, <object_detection.core.matcher.Match object at 0x7f5b02faf6a0>, <object_detection.core.matcher.Match object at 0x7f5b02e315f8>, <object_detection.core.matcher.Match object at 0x7f5b02cb32b0>, <object_detection.core.matcher.Match object at 0x7f5b02b226a0>, <object_detection.core.matcher.Match object at 0x7f5b029a5710>, <object_detection.core.matcher.Match object at 0x7f5b028282b0>, <object_detection.core.matcher.Match object at 0x7f5b026936a0>, <object_detection.core.matcher.Match object at 0x7f5b02518710>, <object_detection.core.matcher.Match object at 0x7f5b0239c2b0>, <object_detection.core.matcher.Match object at 0x7f5b022096a0>, <object_detection.core.matcher.Match object at 0x7f5b0208f710>, <object_detection.core.matcher.Match object at 0x7f5b01f0f2b0>, <object_detection.core.matcher.Match object at 0x7f5b01dfc6a0>, <object_detection.core.matcher.Match object at 0x7f5b01c03710>, <object_detection.core.matcher.Match object at 0x7f5b01a842b0>]. Consider casting elements to a supported type.

PS- I have commented data augmentation option in the config file because it was giving me groundtruth_error

As @ashkanee said, the problem seems to be coming from tensors in the file seq_dataset_builder.py, as the tensors are empty.

Any help would be appreciated.

mswarrow commented 5 years ago

I think I've found a solution. When trying to understand the scripts, I noticed that keys to features were specified in TFSequenceExampleDecoder and they were different from those in DatasetBuilderTest (seq_dataset_builder_test.py). So, I used a script for creating tf records similar to @wrkgm's first version, but replaced image/object/bbox/... with bbox/... and image/object/class/label with bbox/label/index.

Aaronreb commented 5 years ago

I think I've found a solution. When trying to understand the scripts, I noticed that keys to features were specified in TFSequenceExampleDecoder and they were different from those in DatasetBuilderTest (seq_dataset_builder_test.py). So, I used a script for creating tf records similar to @wrkgm's first version, but replaced image/object/bbox/... with bbox/... and image/object/class/label with bbox/label/index.

Did you successfully train the model?

mswarrow commented 5 years ago

Well, not yet. I've just solved this tensorflow.python.framework.errors_impl.InvalidArgumentError and verified that it started to train (iterations shown in tensorboard)

mswarrow commented 5 years ago

I do plan to train a model on my dataset - shall inform you here of the results

Aaronreb commented 5 years ago

I do plan to train a model on my dataset - shall inform you here of the results

On what dataset did you make the tfrecords?

mswarrow commented 5 years ago

It's a specific dataset I use for my project. I can't tell the details, but in general, it is a set of annotated (boxes + labels) images organised in folders - one folder for one sequence. I just take each sequence, split it into snippets of length 4 (as specified in lstm_ssd_mobilenet_v1_imagenet.config) and convert each snippet into TF SequenceExample.

Aaronreb commented 5 years ago

It's a specific dataset I use for my project. I can't tell the details, but in general, it is a set of annotated (boxes + labels) images organised in folders - one folder for one sequence. I just take each sequence, split it into snippets of length 4 (as specified in lstm_ssd_mobilenet_v1_imagenet.config) and convert each snippet into TF SequenceExample.

Can we use the same tfrecords we used for objectdetection? Because on using those tfrecords im getting the following issue

tensorflow.python.framework.errors_impl.InvalidArgumentError: Name: , Feature list 'image/encoded' is required but could not be found. Did you mean to include it in feature_list_dense_missing_assumed_empty or feature_list_dense_defaults? [[{{node ParseSingleSequenceExample/ParseSingleSequenceExample}}]]

mswarrow commented 5 years ago

I haven't worked with object detection tfrecords, but I assume they are not SequenceExamples, but just Examples, and they don't have the required feature lists like image/encoded, bbox/xmin, etc.

Aaronreb commented 5 years ago

I haven't worked with object detection tfrecords, but I assume they are not SequenceExamples, but just Examples, and they don't have the required feature lists like image/encoded, bbox/xmin, etc.

So what do we have to edit in the lstm config file? what do we have to mention in the input path?

mswarrow commented 5 years ago

I'm not fully sure we can use object detection tfrecords for training lstm object detection. If I'm right the seq_dataset_builder.py script wants SequenceExamples for training. Their length can be configured in lstm config file, but you can't replace them with Examples (and I assume, images in object detection datasets are stored independently as Examples) unless you modify the dataset builder itself. It may happen that I'm wrong, of course, as I don't know the exact object detection tfrecord format :)

Aaronreb commented 5 years ago

I'm not fully sure we can use object detection tfrecords for training lstm object detection. If I'm right the seq_dataset_builder.py script wants SequenceExamples for training. Their length can be configured in lstm config file, but you can't replace them with Examples (and I assume, images in object detection datasets are stored independently as Examples) unless you modify the dataset builder itself. It may happen that I'm wrong, of course, as I don't know the exact object detection tfrecord format :)

Yeah, I feel i went for wrong approach. So, what exactly should be "path/to/sequence_example/data" in the config file?

mswarrow commented 5 years ago

I just have a single train.tfrecord file (path/to/train.tfrecord) which is a collection of tf.train.SequenceExample written one after another, each tf.train.SequenceExample has context with features image/format, image/height and image/width, and feature lists with features named image/encoded, bbox/xmin, ..., bbox/ymax and bbox/label/index. See the first tfrecord creating script from @wrkgm post. The only difference is in feature names

Aaronreb commented 5 years ago

I just have a single train.tfrecord file (path/to/train.tfrecord) which is a collection of tf.train.SequenceExample written one after another, each tf.train.SequenceExample has context with features image/format, image/height and image/width, and feature lists with features named image/encoded, bbox/xmin, ..., bbox/ymax and bbox/label/index. See the first tfrecord creating script from @wrkgm post. The only difference is in feature names

Iam trying to use tfrecords made out of oid dataset. Do we have to make tfrrecords of vid dataset?

spaul13 commented 5 years ago

I have used the pets_example.record from object_detection/test_data/ and also getting the same error. @mswarrow, @Aaronreb can you please send/attach your tfrecord file what u r using for training and evaluation of this lstm model (path/to/sequence_example/data). (e.g., it will be great if you can test how to generate tfrecord files for lstm model).

I have tried @wrkgm code to build the sequential dataset tfrecords in order to train the model but I am getting the following error,

_tensorflow.python.framework.errors_impl.InvalidArgumentError: assertion failed: [All sequence lengths must match, but received lengths: 0 All sequence lengths must match, but received lengths: 0 All sequence lengths must match, but received lengths: 4] [[{{node batch_sequences_with_states/Assert2/Assert}}]]

It seems like issue regarding the length mismatch. Can anyone please tell me what parameter to modify to resolve this issue?

@mswarrow @Aaronreb @ashkanee @dreamdragon @whasyt can u please tell me what to add as fine_tune_checkpoint file. I have tried to add "object_detection/test_ckpt/ssd_inception_v2.pb" but its showing error thats why I have disabled the finetuning but can anyone plz tell me what should be put as the fine_tune_checkpoint file?

Any help will highly be appreciated.

yuchen2580 commented 5 years ago

I think I've found a solution. When trying to understand the scripts, I noticed that keys to features were specified in TFSequenceExampleDecoder and they were different from those in DatasetBuilderTest (seq_dataset_builder_test.py). So, I used a script for creating tf records similar to @wrkgm's first version, but replaced image/object/bbox/... with bbox/... and image/object/class/label with bbox/label/index.

Hi, could you please share more details on how to resolve the <object_detection.core.matcher.Match object at 0x7f5b01dfc6a0> problem?

Aaronreb commented 5 years ago

I think I've found a solution. When trying to understand the scripts, I noticed that keys to features were specified in TFSequenceExampleDecoder and they were different from those in DatasetBuilderTest (seq_dataset_builder_test.py). So, I used a script for creating tf records similar to @wrkgm's first version, but replaced image/object/bbox/... with bbox/... and image/object/class/label with bbox/label/index.

On what kind of dataset did you make the tfrecords? Images or video?

mswarrow commented 5 years ago

It's a specific dataset I use for my project. I can't tell the details, but in general, it is a set of annotated (boxes + labels) images organised in folders - one folder for one sequence. I just take each sequence, split it into snippets of length 4 (as specified in lstm_ssd_mobilenet_v1_imagenet.config) and convert each snippet into TF SequenceExample.

Images

mswarrow commented 5 years ago

I think I've found a solution. When trying to understand the scripts, I noticed that keys to features were specified in TFSequenceExampleDecoder and they were different from those in DatasetBuilderTest (seq_dataset_builder_test.py). So, I used a script for creating tf records similar to @wrkgm's first version, but replaced image/object/bbox/... with bbox/... and image/object/class/label with bbox/label/index.

Hi, could you please share more details on how to resolve the <object_detection.core.matcher.Match object at 0x7f5b01dfc6a0> problem?

Not sure I've encountered this...

mswarrow commented 5 years ago

I have used the pets_example.record from object_detection/test_data/ and also getting the same error. @mswarrow, @Aaronreb can you please send/attach your tfrecord file what u r using for training and evaluation of this lstm model (path/to/sequence_example/data). (e.g., it will be great if you can test how to generate tfrecord files for lstm model).

I have tried @wrkgm code to build the sequential dataset tfrecords in order to train the model but I am getting the following error,

_tensorflow.python.framework.errors_impl.InvalidArgumentError: assertion failed: [All sequence lengths must match, but received lengths: 0 All sequence lengths must match, but received lengths: 0 All sequence lengths must match, but received lengths: 4] [[{{node batch_sequences_with_states/Assert2/Assert}}]]

It seems like issue regarding the length mismatch. Can anyone please tell me what parameter to modify to resolve this issue?

@mswarrow @Aaronreb @ashkanee @dreamdragon @whasyt can u please tell me what to add as fine_tune_checkpoint file. I have tried to add "object_detection/test_ckpt/ssd_inception_v2.pb" but its showing error thats why I have disabled the finetuning but can anyone plz tell me what should be put as the fine_tune_checkpoint file?

Any help will highly be appreciated.

I suppose that lstm object detection checkpoint is expected in the config and you cannot use checkpoints from other models.

spaul13 commented 5 years ago

@mswarrow as a fine tune checkpoint file what file name I have to give then? Can u please tell me how u generated fine-tune checkpoint file?

@mswarrow can you please tell me what image dataset you have used? and how you convert that image dataset into sequential dataset tfrecords?

Shruthi-Sampathkumar commented 5 years ago

Replace object_detection.protos.lstm_model with lstm_object_detection.protos.lstm_model in the config. We will fix this issue in the codebase shortly.

Done. Thanks. But, this got me another error as follows:-

TypeError: Expected binary or unicode string, got <object_detection.core.matcher.Match object at 0x7f5b089379b0> During handling of the above exception, another exception occurred: Traceback (most recent call last): File "lstm_object_detection/train.py", line 185, in tf.app.run() File "/home/kt-ml1/.local/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "lstm_object_detection/train.py", line 181, in main worker_job_name, is_chief, FLAGS.train_dir) File "/home/kt-ml1/models-master/models-master/research/lstm_object_detection/trainer.py", line 293, in train clones = model_deploy.create_clones(deploy_config, model_fn, [input_queue]) File "/home/kt-ml1/models-master/models-master/research/slim/deployment/model_deploy.py", line 193, in create_clones outputs = model_fn(*args, **kwargs) File "/home/kt-ml1/models-master/models-master/research/lstm_object_detection/trainer.py", line 174, in _create_losses losses_dict = detection_model.loss(prediction_dict, true_image_shapes) File "/home/kt-ml1/models-master/models-master/research/lstm_object_detection/meta_architectures/lstm_ssd_meta_arch.py", line 165, in loss match_list = [matcher.Match(match) for match in tf.unstack(batch_match)] File "/home/kt-ml1/.local/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 1149, in unstack value = ops.convert_to_tensor(value) File "/home/kt-ml1/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1039, in convert_to_tensor return convert_to_tensor_v2(value, dtype, preferred_dtype, name) File "/home/kt-ml1/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1097, in convert_to_tensor_v2 as_ref=False) File "/home/kt-ml1/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1175, in internal_convert_to_tensor ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref) File "/home/kt-ml1/.local/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py", line 304, in _constant_tensor_conversion_function return constant(v, dtype=dtype, name=name) File "/home/kt-ml1/.local/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py", line 245, in constant allow_broadcast=True) File "/home/kt-ml1/.local/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py", line 283, in _constant_impl allow_broadcast=allow_broadcast)) File "/home/kt-ml1/.local/lib/python3.6/site-packages/tensorflow/python/framework/tensor_util.py", line 562, in make_tensor_proto "supported type." % (type(values), values)) TypeError: Failed to convert object of type <class 'list'> to Tensor. Contents: [<object_detection.core.matcher.Match object at 0x7f5b089379b0>, <object_detection.core.matcher.Match object at 0x7f5b087b75c0>, <object_detection.core.matcher.Match object at 0x7f5b08626fd0>, <object_detection.core.matcher.Match object at 0x7f5b084a9b00>, <object_detection.core.matcher.Match object at 0x7f5b083302b0>, <object_detection.core.matcher.Match object at 0x7f5b0819a6a0>, <object_detection.core.matcher.Match object at 0x7f5b03fe2710>, <object_detection.core.matcher.Match object at 0x7f5b03e672b0>, <object_detection.core.matcher.Match object at 0x7f5b03cd36a0>, <object_detection.core.matcher.Match object at 0x7f5b03b53710>, <object_detection.core.matcher.Match object at 0x7f5b039d92b0>, <object_detection.core.matcher.Match object at 0x7f5b038456a0>, <object_detection.core.matcher.Match object at 0x7f5b036ca710>, <object_detection.core.matcher.Match object at 0x7f5b0354e2b0>, <object_detection.core.matcher.Match object at 0x7f5b034396a0>, <object_detection.core.matcher.Match object at 0x7f5b032bb710>, <object_detection.core.matcher.Match object at 0x7f5b0313f2b0>, <object_detection.core.matcher.Match object at 0x7f5b02faf6a0>, <object_detection.core.matcher.Match object at 0x7f5b02e315f8>, <object_detection.core.matcher.Match object at 0x7f5b02cb32b0>, <object_detection.core.matcher.Match object at 0x7f5b02b226a0>, <object_detection.core.matcher.Match object at 0x7f5b029a5710>, <object_detection.core.matcher.Match object at 0x7f5b028282b0>, <object_detection.core.matcher.Match object at 0x7f5b026936a0>, <object_detection.core.matcher.Match object at 0x7f5b02518710>, <object_detection.core.matcher.Match object at 0x7f5b0239c2b0>, <object_detection.core.matcher.Match object at 0x7f5b022096a0>, <object_detection.core.matcher.Match object at 0x7f5b0208f710>, <object_detection.core.matcher.Match object at 0x7f5b01f0f2b0>, <object_detection.core.matcher.Match object at 0x7f5b01dfc6a0>, <object_detection.core.matcher.Match object at 0x7f5b01c03710>, <object_detection.core.matcher.Match object at 0x7f5b01a842b0>]. Consider casting elements to a supported type.

PS- I have commented data augmentation option in the config file because it was giving me groundtruth_error

As @ashkanee said, the problem seems to be coming from tensors in the file seq_dataset_builder.py, as the tensors are empty.

Any help would be appreciated.

Hi. I am facing the same issue (<object_detection.core.matcher.Match object at 0x7f8a3749c320>). Were you able to solve it? @Aaronreb I tried the solution by @mswarrow of changing the field names in seq-dataset_builder_test, but it still threw the same error.

Any help would be appreciated. Thanks.

Aaronreb commented 5 years ago

Well, not yet. I've just solved this tensorflow.python.framework.errors_impl.InvalidArgumentError and verified that it started to train (iterations shown in tensorboard)

What all changes you did in the config file? Did you comment out the checkpoint part?

Shruthi-Sampathkumar commented 5 years ago

I think I've found a solution. When trying to understand the scripts, I noticed that keys to features were specified in TFSequenceExampleDecoder and they were different from those in DatasetBuilderTest (seq_dataset_builder_test.py). So, I used a script for creating tf records similar to @wrkgm's first version, but replaced image/object/bbox/... with bbox/... and image/object/class/label with bbox/label/index.

Hi, could you please share more details on how to resolve the <object_detection.core.matcher.Match object at 0x7f5b01dfc6a0> problem?

Not sure I've encountered this...

Would you mind sharing versions of the packages that you used? @mswarrow I might be encountering this issue due to some version conflict.

mswarrow commented 5 years ago

@mswarrow as a fine tune checkpoint file what file name I have to give then? Can u please tell me how u generated fine-tune checkpoint file?

First, I haven't used any checkpoints. But. When you start training lstm object detector, you pass a path to train direction (--train_dir) to the script, and during the training process some data is being written there - and model checkpoints as well (model.ckpt...). I guess you may use them to continue training your model. But I have no idea where to get pre-trained checkpoints :(

@mswarrow can you please tell me what image dataset you have used? and how you convert that image dataset into sequential dataset tfrecords?

I use a private dataset. This was collected manually for one the projects I participate in. As I said before, all images I have are divided into sequences and each sequence is store in a separate folder, say

sequences/
   seq001/
   seq002/
   ...

Then I use a script which reads images sequence by sequence, splits each sequence into chunks of 4 images and creates a TFSequenceExample for each chunk - just like described in the first script by @wrkgm, see

I tried two ways of creating tfrecords. The first was taken from tf_sequence_example_decoder_test.py, in this repo. The only change...

Did it make it clearer?

mswarrow commented 5 years ago

Well, not yet. I've just solved this tensorflow.python.framework.errors_impl.InvalidArgumentError and verified that it started to train (iterations shown in tensorboard)

What all changes you did in the config file? Did you comment out the checkpoint part?

I just left it blank:

  fine_tune_checkpoint: ""

mswarrow commented 5 years ago

Replace object_detection.protos.lstm_model with lstm_object_detection.protos.lstm_model in the config. We will fix this issue in the codebase shortly.

Done. Thanks. But, this got me another error as follows:-

TypeError: Expected binary or unicode string, got <object_detection.core.matcher.Match object at 0x7f5b089379b0> During handling of the above exception, another exception occurred: Traceback (most recent call last): File "lstm_object_detection/train.py", line 185, in tf.app.run() File "/home/kt-ml1/.local/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "lstm_object_detection/train.py", line 181, in main worker_job_name, is_chief, FLAGS.train_dir) File "/home/kt-ml1/models-master/models-master/research/lstm_object_detection/trainer.py", line 293, in train clones = model_deploy.create_clones(deploy_config, model_fn, [input_queue]) File "/home/kt-ml1/models-master/models-master/research/slim/deployment/model_deploy.py", line 193, in create_clones outputs = model_fn(*args, **kwargs) File "/home/kt-ml1/models-master/models-master/research/lstm_object_detection/trainer.py", line 174, in _create_losses losses_dict = detection_model.loss(prediction_dict, true_image_shapes) File "/home/kt-ml1/models-master/models-master/research/lstm_object_detection/meta_architectures/lstm_ssd_meta_arch.py", line 165, in loss match_list = [matcher.Match(match) for match in tf.unstack(batch_match)] File "/home/kt-ml1/.local/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 1149, in unstack value = ops.convert_to_tensor(value) File "/home/kt-ml1/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1039, in convert_to_tensor return convert_to_tensor_v2(value, dtype, preferred_dtype, name) File "/home/kt-ml1/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1097, in convert_to_tensor_v2 as_ref=False) File "/home/kt-ml1/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1175, in internal_convert_to_tensor ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref) File "/home/kt-ml1/.local/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py", line 304, in _constant_tensor_conversion_function return constant(v, dtype=dtype, name=name) File "/home/kt-ml1/.local/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py", line 245, in constant allow_broadcast=True) File "/home/kt-ml1/.local/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py", line 283, in _constant_impl allow_broadcast=allow_broadcast)) File "/home/kt-ml1/.local/lib/python3.6/site-packages/tensorflow/python/framework/tensor_util.py", line 562, in make_tensor_proto "supported type." % (type(values), values)) TypeError: Failed to convert object of type <class 'list'> to Tensor. Contents: [<object_detection.core.matcher.Match object at 0x7f5b089379b0>, <object_detection.core.matcher.Match object at 0x7f5b087b75c0>, <object_detection.core.matcher.Match object at 0x7f5b08626fd0>, <object_detection.core.matcher.Match object at 0x7f5b084a9b00>, <object_detection.core.matcher.Match object at 0x7f5b083302b0>, <object_detection.core.matcher.Match object at 0x7f5b0819a6a0>, <object_detection.core.matcher.Match object at 0x7f5b03fe2710>, <object_detection.core.matcher.Match object at 0x7f5b03e672b0>, <object_detection.core.matcher.Match object at 0x7f5b03cd36a0>, <object_detection.core.matcher.Match object at 0x7f5b03b53710>, <object_detection.core.matcher.Match object at 0x7f5b039d92b0>, <object_detection.core.matcher.Match object at 0x7f5b038456a0>, <object_detection.core.matcher.Match object at 0x7f5b036ca710>, <object_detection.core.matcher.Match object at 0x7f5b0354e2b0>, <object_detection.core.matcher.Match object at 0x7f5b034396a0>, <object_detection.core.matcher.Match object at 0x7f5b032bb710>, <object_detection.core.matcher.Match object at 0x7f5b0313f2b0>, <object_detection.core.matcher.Match object at 0x7f5b02faf6a0>, <object_detection.core.matcher.Match object at 0x7f5b02e315f8>, <object_detection.core.matcher.Match object at 0x7f5b02cb32b0>, <object_detection.core.matcher.Match object at 0x7f5b02b226a0>, <object_detection.core.matcher.Match object at 0x7f5b029a5710>, <object_detection.core.matcher.Match object at 0x7f5b028282b0>, <object_detection.core.matcher.Match object at 0x7f5b026936a0>, <object_detection.core.matcher.Match object at 0x7f5b02518710>, <object_detection.core.matcher.Match object at 0x7f5b0239c2b0>, <object_detection.core.matcher.Match object at 0x7f5b022096a0>, <object_detection.core.matcher.Match object at 0x7f5b0208f710>, <object_detection.core.matcher.Match object at 0x7f5b01f0f2b0>, <object_detection.core.matcher.Match object at 0x7f5b01dfc6a0>, <object_detection.core.matcher.Match object at 0x7f5b01c03710>, <object_detection.core.matcher.Match object at 0x7f5b01a842b0>]. Consider casting elements to a supported type.

PS- I have commented data augmentation option in the config file because it was giving me groundtruth_error As @ashkanee said, the problem seems to be coming from tensors in the file seq_dataset_builder.py, as the tensors are empty. Any help would be appreciated.

Hi. I am facing the same issue (<object_detection.core.matcher.Match object at 0x7f8a3749c320>). Were you able to solve it? @Aaronreb I tried the solution by @mswarrow of changing the field names in seq-dataset_builder_test, but it still threw the same error.

Any help would be appreciated. Thanks.

seq-dataset_builder_test.py has nothing to do with the train.py script. I mentioned it because I used it to understand the structure of the dataset used by train.py. Then I found that seq-dataset_builder.py, actually used by train.py, has field names different from those in the dataset builder test script. That's what I meant in my very first post here. The idea is to adapt @wrkgm's script for converting your raw data into TFSequenceExamples and update some field names, like 'image/object/bbox/ymin' to 'bbox/ymin', etc.

mswarrow commented 5 years ago

Disclaimer

Though I have managed to create a dataset and start training, i.e. I could see different metrics in the tensorboard dashbord: loss, number of negatives, etc., I'm not fully sure that the model I got is a trully trained model - I haven't tested it yet.

Aaronreb commented 5 years ago

Disclaimer

Though I have managed to create a dataset and start training, i.e. I could see different metrics in the tensorboard dashbord: loss, number of negatives, etc., I'm not fully sure that the model I got is a trully trained model - I haven't tested it yet.

Well explained, mate. Just had a doubt, how do u divide the image dataset into sequence of images? I have an open image dataset (Dataset with some random images and annotation file), how do i divide it into the sequence of images?

hoonkai commented 5 years ago

@mswarrow

First, I haven't used any checkpoints I use a private dataset

Is your dataset huge? Wouldn't it be a better idea to train on something like Imagenet VID (as per the paper) and fine-tune using your own custom dataset?

Shruthi-Sampathkumar commented 5 years ago

Disclaimer

Though I have managed to create a dataset and start training, i.e. I could see different metrics in the tensorboard dashbord: loss, number of negatives, etc., I'm not fully sure that the model I got is a trully trained model - I haven't tested it yet.

I am now getting the following error

tensorflow.python.framework.errors_impl.InvalidArgumentError: Number of ways to split should evenly divide the split dimension, but got split_dim 0 (size = 21) and num_split 20 [[{{node split_2}}]]

My input images are in PNG format. When I encode them as it is, I get the above mentioned error. I tried converting PNG images to JPG, but still I am getting the same error. Any help would be appreciated.

KanaSukita commented 5 years ago

Disclaimer

Though I have managed to create a dataset and start training, i.e. I could see different metrics in the tensorboard dashbord: loss, number of negatives, etc., I'm not fully sure that the model I got is a trully trained model - I haven't tested it yet.

I am now getting the following error

tensorflow.python.framework.errors_impl.InvalidArgumentError: Number of ways to split should evenly divide the split dimension, but got split_dim 0 (size = 21) and num_split 20 [[{{node split_2}}]]

My input images are in PNG format. When I encode them as it is, I get the above mentioned error. I tried converting PNG images to JPG, but still I am getting the same error. Any help would be appreciated.

I got similar error when I didn't split the dataset into snippets with the same length in the config. Maybe you can look into this

Shruthi-Sampathkumar commented 5 years ago

Disclaimer

Though I have managed to create a dataset and start training, i.e. I could see different metrics in the tensorboard dashbord: loss, number of negatives, etc., I'm not fully sure that the model I got is a trully trained model - I haven't tested it yet.

I am now getting the following error tensorflow.python.framework.errors_impl.InvalidArgumentError: Number of ways to split should evenly divide the split dimension, but got split_dim 0 (size = 21) and num_split 20 [[{{node split_2}}]] My input images are in PNG format. When I encode them as it is, I get the above mentioned error. I tried converting PNG images to JPG, but still I am getting the same error. Any help would be appreciated.

I got similar error when I didn't split the dataset into snippets with the same length in the config. Maybe you can look into this

Thank you @KanaSukita. The issue was caused due to the fact that I had the unroll length set to 20 but actually my dataset consisted of 21 frames from each video. So the 0th dimension was 20 which was actually supposed to be 21. Fixed it by setting the unroll length to be 21.

Shruthi-Sampathkumar commented 5 years ago

train.py is getting killed with the error "/var/spool/slurmd/job2947915/slurm_script: line 21: 53501 Bus error (core dumped) python train.py". Any idea why this is happening? There are no errors thrown. So I am not sure what is wrong. Thanks in advance.

yuchen2580 commented 5 years ago

@ashkanee @wrkgm Hi, would it be possible to share your solution to solving the 'groundtruth_weight' issue? In where exactly you add the groundtruth_weights to None?

KanaSukita commented 5 years ago

@ashkanee @wrkgm Hi, would it be possible to share your solution to solving the 'groundtruth_weight' issue? In where exactly you add the groundtruth_weights to None?

@yuchen2580 I solve it by comment line 3306-3308 in objection/core/preprocessor.py, which are if include_label_weights: groundtruth_label_weights = ( fields.InputDataFields.groundtruth_weights)

yuchen2580 commented 5 years ago

@ashkanee @wrkgm Hi, would it be possible to share your solution to solving the 'groundtruth_weight' issue? In where exactly you add the groundtruth_weights to None?

@yuchen2580 I solve it by comment line 3306-3308 in objection/core/preprocessor.py, which are if include_label_weights: groundtruth_label_weights = ( fields.InputDataFields.groundtruth_weights)

Thanks for answering~ I wonder if there is another way to get around with it. Cause 'ssd_random_crop' works fine in object detection API. and changing the core seems to be a not suitable idea... I wonder what is the difference.

KanaSukita commented 5 years ago

Hi guys I've trained with VID 2015 for a while and try to evaluate the model. But I got very wired results when running eval.py. Below it is the output when I am testing on the training tfrecord:

Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.005 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.019 Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.002 Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000 Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000 Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.013 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.024 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.024 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.024 Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000 Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000 Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.069 I0705 11:04:54.779381 139871785002752 eval_util.py:75] Writing metrics to tf summary. I0705 11:04:54.779692 139871785002752 eval_util.py:82] DetectionBoxes_Precision/mAP: 0.005298 I0705 11:04:54.779949 139871785002752 eval_util.py:82] DetectionBoxes_Precision/mAP (large): 0.012879 I0705 11:04:54.780024 139871785002752 eval_util.py:82] DetectionBoxes_Precision/mAP (medium): 0.000000 I0705 11:04:54.780087 139871785002752 eval_util.py:82] DetectionBoxes_Precision/mAP (small): 0.000000 I0705 11:04:54.780147 139871785002752 eval_util.py:82] DetectionBoxes_Precision/mAP@.50IOU: 0.018689 I0705 11:04:54.780205 139871785002752 eval_util.py:82] DetectionBoxes_Precision/mAP@.75IOU: 0.002270 I0705 11:04:54.780354 139871785002752 eval_util.py:82] DetectionBoxes_Recall/AR@1: 0.024408 I0705 11:04:54.780430 139871785002752 eval_util.py:82] DetectionBoxes_Recall/AR@10: 0.024408 I0705 11:04:54.780501 139871785002752 eval_util.py:82] DetectionBoxes_Recall/AR@100: 0.024408 I0705 11:04:54.780809 139871785002752 eval_util.py:82] DetectionBoxes_Recall/AR@100 (large): 0.068924 I0705 11:04:54.781460 139871785002752 eval_util.py:82] DetectionBoxes_Recall/AR@100 (medium): 0.000000 I0705 11:04:54.781524 139871785002752 eval_util.py:82] DetectionBoxes_Recall/AR@100 (small): 0.000000

When looking into the summary in tensorboard I found the loss shaking dramatically.

I am not sure if it's related to the tfrecord or my config. I convert VID 2015 to tfrecord based on tf-detectors and set class index from 1-30 and enable encode_background_as_zeros in the config. When thers is no object I left the label index list empty.

Does anyone successfully train the model at all? Any help is appreciated.

yuchen2580 commented 5 years ago

@KanaSukita Hi, I trained the model on VID 2015 as well. The result is similar to yours. I trained it for num_stpes = 10000 steps. video_length =4. total loss reaches 0.3-0.8. The evaluation result is similar to yours.

Another thing I noticed is that somehow if I resume the training in the middle, the loss climb back. Feels like it cannot restore to the state of previously loaded state.

Perhaps there is something wrong with loading parameters. How many steps you used for training? What is the loss when you finish training it?

tensorflow / models

LSTM Object Detection Model Does Not Run #6253

System information

Describe the problem

EDIT:

Source code / logs

System information

Describe the problem

6027

5869

Source code / logs

Disclaimer

Disclaimer

Disclaimer

Disclaimer

Disclaimer