keras-team / keras-cv

Industry-strength Computer Vision workflows with Keras
Other
1k stars 333 forks source link

https://keras.io/examples/vision/yolov8/ training issue in Google Colab #2025

Closed Paryavi closed 1 year ago

Paryavi commented 1 year ago

My images are small, it is 6k images with following sizes; {(400, 296): 3484, (480, 320): 2763, (640, 480): 108}

The error I get after finishing training in 1 epoch(I train in Googl Colab), initially loss reduces but after end of first training epoch it crashes as follows;

Epoch 1/3 1271/1271 [==============================] - ETA: 0s - loss: 21.5977 - box_loss: 2.6112 - class_loss: 18.9865 UnknownError Traceback (most recent call last) in <cell line: 1>() ----> 1 yolo.fit( 2 train_ds, 3 validation_data=val_ds, 4 epochs=3, 5 callbacks=[EvaluateCOCOMetricsCallback(val_ds, "model.h5")],

1 frames in on_epoch_end(self, epoch, logs) 18 self.metrics.update_state(y_true, y_pred) 19 ---> 20 metrics = self.metrics.result(force=True) 21 logs.update(metrics) 22

UnknownError: {{function_node _wrapped__EagerPyFunc_Tin_1_Tout_1_device/job:localhost/replica:0/task:0/device:CPU:0}} InvalidArgumentError: {{function_node _wrapped__ConcatV2_N_317_device/job:localhost/replica:0/task:0/device:CPU:0}} ConcatOp : Dimension 1 in both shapes must be equal: shape[0] = [4,2,4] vs. shape[1] = [4,1,4] [Op:ConcatV2] name: concat Traceback (most recent call last):

File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/ops/script_ops.py", line 265, in call return func(device, token, args)

File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/ops/script_ops.py", line 143, in call outputs = self._call(device, args)

File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/ops/script_ops.py", line 150, in _call ret = self._func(*args)

File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/autograph/impl/api.py", line 642, in wrapper return func(*args, **kwargs)

File "/usr/local/lib/python3.10/dist-packages/keras_cv/metrics/object_detection/box_coco_metrics.py", line 203, in result_on_host_cpu return tf.constant(obj_result(force), obj.dtype)

File "/usr/local/lib/python3.10/dist-packages/keras_cv/metrics/object_detection/box_coco_metrics.py", line 254, in result self._cached_result = self._compute_result()

File "/usr/local/lib/python3.10/dist-packages/keras_cv/metrics/object_detection/box_coco_metrics.py", line 262, in _compute_result _box_concat(self.ground_truths),

File "/usr/local/lib/python3.10/dist-packages/keras_cv/metrics/object_detection/box_coco_metrics.py", line 44, in _box_concat result[key] = tf.concat([b[key] for b in boxes], axis=0)

File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler raise e.with_traceback(filtered_tb) from None

File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/ops.py", line 7262, in raise_from_not_ok_status raise core._status_to_exception(e) from None # pylint: disable=protected-access

tensorflow.python.framework.errors_impl.InvalidArgumentError: {{function_node _wrapped__ConcatV2_N_317_device/job:localhost/replica:0/task:0/device:CPU:0}} ConcatOp : Dimension 1 in both shapes must be equal: shape[0] = [4,2,4] vs. shape[1] = [4,1,4] [Op:ConcatV2] name: concat

[Op:EagerPyFunc]

Paryavi commented 1 year ago

What Colab AI suggests;

The error is coming from the keras_cv.metrics.object_detection.box_coco_metrics module. The _box_concat function is expecting two tensors with the same shape, but the shapes of the tensors in your input are not the same. The first tensor has a shape of [4, 2, 4], while the second tensor has a shape of [4, 1, 4].

To fix this error, you need to make sure that the two tensors have the same shape. One way to do this is to use the tf.expand_dims function to add a new dimension to the second tensor. For example, you could use the following code:

tf.expand_dims(val_ds, axis=1) Use code with caution This will add a new dimension to the second tensor, making it have the same shape as the first tensor. Once you have done this, you should be able to run the yolo.fit function without any errors.

jbischof commented 1 year ago

This is the same error I got in our main OD tutorial: #2017. Good to know if also affects the YOLOv8 example.

ianstenbit commented 1 year ago

I'm taking a look -- thanks for the issue report!

jbischof commented 1 year ago

This is the offending line for reference: https://github.com/keras-team/keras-cv/blob/2ff8e3fd764bc67342778894cc984daac95c4813/keras_cv/metrics/object_detection/box_coco_metrics.py#L44

ianstenbit commented 1 year ago

This is the offending line for reference:

https://github.com/keras-team/keras-cv/blob/2ff8e3fd764bc67342778894cc984daac95c4813/keras_cv/metrics/object_detection/box_coco_metrics.py#L44

Thanks! It looks like this just expects padded boxes and the tutorial is not padding them correctly. Probably what happened is that something in KerasCV used to be turning them into Dense tensors and padding them but for some reason isn't anymore.

Paryavi commented 1 year ago

Great, let's add @LukeWood and @IMvision12 to the loop! I kinda give up on Yolo yesterday. Today, I am trying to train based on KerasCV RetinaNet example; https://lukewood.xyz/blog/marine-animal-detection I was able to split my labelbox json to 3 folders as Luke wood example, and I made the generator work(I was using CPU in colab, then with gpu generator and then visualization function works!). my bbox format is xywh, when doing model.fit I get this error;

history = model.fit( train_ds.take(1), validation_data=eval_ds.take(1), epochs=1 # EPOCHS )

Btw, this is how I modified the generator;

  def load(*, split, bounding_box_format):
if split not in splits:
    raise ValueError(
        f"Invalid split provided, `split={split}`. "
        f"Expected one of {list(splits.keys())}"
    )

path = splits[split]
with open(os.path.join(path, 'annotations.json'), 'r') as f:
    file_annotations = json.load(f)

# Create a dictionary to map image_ids to image file paths for quick lookup
image_id_to_file_path = {img['id']: img['file_name'] for img in file_annotations['images']}

def generator():
    for image_entry in file_annotations['images']:
        image_id = image_entry['id']
        image_path = image_id_to_file_path.get(image_id, None)
        if not image_path:
            continue

        annotations_for_image = [anno for anno in file_annotations['annotations'] if anno['image_id'] == image_id]

        box_labels = []
        class_labels = []
        for annotation in annotations_for_image:
            box = annotation['bbox']
            box = tf.constant([float(coord) for coord in box], tf.float32)
            box_labels.append(box)
            class_labels.append(tf.constant(float(annotation['category_id']), tf.float32))

        if not box_labels:
            continue

        bounding_boxes = {
            'boxes': tf.stack(box_labels),
            'classes': tf.stack(class_labels)
        }

        image = load_image(os.path.join(path, image_path))
        bounding_boxes = keras_cv.bounding_box.convert_format(bounding_boxes, source='xywh', target=bounding_box_format)

        yield {
            'images': image,
            'bounding_boxes': bounding_boxes
        }

output_spec = {
    'images': tf.TensorSpec(shape=(None, None, 3)),
    'bounding_boxes': {
        'boxes': tf.TensorSpec(shape=(None, 4)),
        'classes': tf.TensorSpec(shape=(None,))
    }
}
return tf.data.Dataset.from_generator(generator, output_signature=output_spec)

Error; ValueError Traceback (most recent call last) in <cell line: 1>() ----> 1 history = model.fit( 2 train_ds.take(1), 3 validation_data=eval_ds.take(1), 4 epochs=1 # EPOCHS 5 )

4 frames /usr/local/lib/python3.10/dist-packages/keras_cv/models/object_detection/retinanet/retinanet_label_encoder.py in tf_encode_sample(self, box_labels, anchor_boxes, image_shape) 53 batch_size = ag.ld(box_shape)[0] 54 n_boxes = ag.ld(box_shape)[1] ---> 55 box_ids = ag.converted_call(ag.ld(ops).arange, (ag.ld(gt_boxes).shape[1],), dict(dtype=ag.ld(matched_gt_idx).dtype), fscope) 56 matched_ids = ag__.converted_call(ag.ld(ops).expand_dims, (ag.ld(matched_gt_idx),), dict(axis=-1), fscope) 57 matches = ag.ld(box_ids) == ag__.ld(matched_ids)

ValueError: in user code:

File "/usr/local/lib/python3.10/dist-packages/keras/src/engine/training.py", line 1338, in train_function  *
    return step_function(self, iterator)
File "/usr/local/lib/python3.10/dist-packages/keras/src/engine/training.py", line 1322, in step_function  **
    outputs = model.distribute_strategy.run(run_step, args=(data,))
File "/usr/local/lib/python3.10/dist-packages/keras/src/engine/training.py", line 1303, in run_step  **
    outputs = model.train_step(data)
File "/usr/local/lib/python3.10/dist-packages/keras_cv/models/object_detection/retinanet/retinanet.py", line 465, in train_step
    boxes, classes = self.label_encoder(x, y_for_label_encoder)
File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
File "/tmp/__autograph_generated_filel16ndk94.py", line 48, in tf__call
    result = ag__.converted_call(ag__.ld(self)._encode_sample, (ag__.ld(box_labels), ag__.ld(anchor_boxes), ag__.ld(image_shape)), None, fscope)
File "/tmp/__autograph_generated_filez4i80vy6.py", line 55, in tf___encode_sample
    box_ids = ag__.converted_call(ag__.ld(ops).arange, (ag__.ld(gt_boxes).shape[1],), dict(dtype=ag__.ld(matched_gt_idx).dtype), fscope)

ValueError: Exception encountered when calling layer 'retina_net_label_encoder_3' (type RetinaNetLabelEncoder).

in user code:

    File "/usr/local/lib/python3.10/dist-packages/keras_cv/models/object_detection/retinanet/retinanet_label_encoder.py", line 215, in call  *
        result = self._encode_sample(box_labels, anchor_boxes, image_shape)
    File "/usr/local/lib/python3.10/dist-packages/keras_cv/models/object_detection/retinanet/retinanet_label_encoder.py", line 169, in _encode_sample  *
        box_ids = ops.arange(gt_boxes.shape[1], dtype=matched_gt_idx.dtype)

    ValueError: None values not supported.

Call arguments received by layer 'retina_net_label_encoder_3' (type RetinaNetLabelEncoder):
  • images=tf.Tensor(shape=(None, 640, 640, 3), dtype=float32)
  • box_labels={'boxes': 'tf.RaggedTensor(values=Tensor("RaggedFromVariant/RaggedTensorFromVariant:1", shape=(None, None), dtype=float32), row_splits=Tensor("RaggedFromVariant/RaggedTensorFromVariant:0", shape=(None,), dtype=int64))', 'classes': 'tf.RaggedTensor(values=Tensor("RaggedFromVariant_1/RaggedTensorFromVariant:1", shape=(None,), dtype=float32), row_splits=Tensor("RaggedFromVariant_1/RaggedTensorFromVariant:0", shape=(None,), dtype=int64))'}
Paryavi commented 1 year ago

If needed I can email you guys my Colab codes for YoLo, and RetinaNet and the data; images(it's very small 20KB images) and annotations; JSON files, XMLS(in Yolo case) that you can mount using Google Drive. But I also think it should be the padding issue.

Paryavi commented 1 year ago

I also think it is because my images are small, the padding gets the issue?

My dataset has around 6k images with these pixel values;

Meanwhile, I will try to write a Python script to resize the images and my xywh bounding boxes to 640 by 640 pixels. If there is a code for that let me know, since I have 3 sizes of images as mentioned above.

jbischof commented 1 year ago

@Paryavi I would recommend filing a new issue for each problem. We want each issue to have a clear deliverable and scope.

ianstenbit commented 1 year ago

@Paryavi my recommendation would be to use the PyCOCOCallback for metric evaluation, as BoxCOCOMetrics were created when we were TF-only and won't support Torch+JAX, so they are likely to be deprecated.

Paryavi commented 1 year ago

For the marine_animal blog example, the solution to ragged tensors problem was using .to_dense() function as follows;

First I imported from keras_cv import bounding_box

Then before compile, I added

def dict_to_tuple(inputs): return inputs["images"], bounding_box.to_dense( inputs["bounding_boxes"], max_boxes=32 )

Reference; https://keras.io/guides/keras_cv/object_detection_keras_cv/ Thanks to @LukeWood

IMvision12 commented 1 year ago

But I guess both the models should work with Ragged Tensors, can you try with ragged tenors? @Paryavi

Paryavi commented 1 year ago

model.fit does not work so far with my dataset @ianstenbit I am not using Torch+JAX, I use this Yolo example Tensorflow backend. Is there a pycococallback implementation sample? I will search in Keras API for it.

Paryavi commented 1 year ago

I found pycoco_callback; https://github.com/keras-team/keras-cv/blob/master/keras_cv/callbacks/pycoco_callback_test.py And an example to use it; https://github.com/keras-team/keras-cv/blob/master/examples/training/object_detection/pascal_voc/retinanet.py Will try to modify the Yolo code and see if I can fix it. and here is how it worked by modifying the following;

from keras_cv.callbacks import PyCOCOCallback

def dict_to_tuple(inputs): return inputs["images"], bounding_box.to_dense( inputs["bounding_boxes"], max_boxes=32 )

train_ds = train_ds.map(dict_to_tuple, num_parallel_calls=tf.data.AUTOTUNE) val_ds = val_ds.map(dict_to_tuple, num_parallel_calls=tf.data.AUTOTUNE)

train_ds = train_ds.prefetch(tf.data.AUTOTUNE) val_ds = val_ds.prefetch(tf.data.AUTOTUNE) callback = PyCOCOCallback( validation_data=val_ds, bounding_box_format="xywh", ) yolo.fit( train_ds, validation_data=val_ds, epochs=2, callbacks=[callback], )

IMvision12 commented 1 year ago

@ianstenbit is RaggedTensors supported by keras-core?

ianstenbit commented 1 year ago

@ianstenbit is RaggedTensors supported by keras-core?

Ragged Tensors will work with KerasCV when using Keras Core with the TF backend. It will not work for other backends.

Paryavi commented 1 year ago

Thanks @ianstenbit, So @IMvision12, in your Yolo example did you use Keras Core with the TF backend or Keras Core with the other backends? I guess just the imports should be different in these two choices.

IMvision12 commented 1 year ago

Yeah I am updating the yolov8 example here : https://github.com/keras-team/keras-io/pull/1514

Paryavi commented 1 year ago

That's cool, keep hammering! A different question; for yolov8, what would you do if the images were very small, 400 by 600 pixels? Should I modify the filters(kernels), or freeze more/less layers, how? Where would you look (for a concrete code) how to finetune the yolov8 model if mAP was not good?

jbischof commented 1 year ago

@Paryavi please file another issue ;)

AsadujjamanTuhin commented 11 months ago

import tensorflow as tf import random import pandas as pd # Add this import from tensorflow.keras.optimizers import Adam

Define the number of episodes and other hyperparameters

num_episodes = 100 num_support_samples_per_class = 5 num_query_samples_per_class = 5 few_shot_learning_rate = 0.001

Define the input shape based on your model

input_shape = (224, 224)

Create an optimizer for few-shot learning.

few_shot_optimizer = Adam(learning_rate=few_shot_learning_rate)

Training loop for episodes

for episode in range(num_episodes):

Create an episode with a support set and a query set.

support_set, query_set = create_episode(
    num_support_samples_per_class, num_query_samples_per_class
)

# Create data generators for the support and query sets.
# Create data generators for the support and query sets.
batch_size = 5  # Set your desired batch size
support_data_generator = create_data_generator(support_set, batch_size=batch_size)
query_data_generator = create_data_generator(query_set, batch_size=batch_size)

support_data_generator = create_data_generator(support_set, batch_size=5)

#query_data_generator = create_data_generator(query_set, batch_size=5)

#support_data_generator = create_data_generator(support_set, input_shape, batch_size=num_support_samples_per_class)
#query_data_generator = create_data_generator(query_set, input_shape, batch_size=num_query_samples_per_class)

# Train the few-shot model on the support set.
few_shot_model.compile(optimizer=few_shot_optimizer, loss='categorical_crossentropy', metrics=['accuracy'])
few_shot_model.fit(support_data_generator, epochs=1, verbose=0)

# Evaluate the few-shot model on the query set.
evaluation_metrics = few_shot_model.evaluate(query_data_generator, verbose=0)
accuracy = evaluation_metrics[1]  # Assuming accuracy is the second metric in the list.

print(f"Episode {episode + 1}: Accuracy = {accuracy:.2%}")

# Update the few-shot model's weights (you can implement your own update logic here)

Found 50 validated image filenames belonging to 10 classes. Found 50 validated image filenames belonging to 10 classes.


ValueError Traceback (most recent call last)

in <cell line: 20>() 38 # Train the few-shot model on the support set. 39 few_shot_model.compile(optimizer=few_shot_optimizer, loss='categorical_crossentropy', metrics=['accuracy']) ---> 40 few_shot_model.fit(support_data_generator, epochs=1, verbose=0) 41 42 # Evaluate the few-shot model on the query set.

1 frames

/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py in error_handler(*args, **kwargs) 68 # To get the full stack trace, call: 69 # tf.debugging.disable_traceback_filtering() ---> 70 raise e.with_traceback(filtered_tb) from None 71 finally: 72 del filtered_tb

/usr/local/lib/python3.10/dist-packages/keras/src/engine/training.py in tftrain_function(iterator) 13 try: 14 doreturn = True ---> 15 retval = ag__.converted_call(ag.ld(step_function), (ag.ld(self), ag.ld(iterator)), None, fscope) 16 except: 17 do_return = False

ValueError: in user code:

File "/usr/local/lib/python3.10/dist-packages/keras/src/engine/training.py", line 1338, in train_function  *
    return step_function(self, iterator)
File "/usr/local/lib/python3.10/dist-packages/keras/src/engine/training.py", line 1322, in step_function  **
    outputs = model.distribute_strategy.run(run_step, args=(data,))
File "/usr/local/lib/python3.10/dist-packages/keras/src/engine/training.py", line 1303, in run_step  **
    outputs = model.train_step(data)
File "/usr/local/lib/python3.10/dist-packages/keras/src/engine/training.py", line 1081, in train_step
    loss = self.compute_loss(x, y, y_pred, sample_weight)
File "/usr/local/lib/python3.10/dist-packages/keras/src/engine/training.py", line 1139, in compute_loss
    return self.compiled_loss(
File "/usr/local/lib/python3.10/dist-packages/keras/src/engine/compile_utils.py", line 265, in __call__
    loss_value = loss_obj(y_t, y_p, sample_weight=sw)
File "/usr/local/lib/python3.10/dist-packages/keras/src/losses.py", line 142, in __call__
    losses = call_fn(y_true, y_pred)
File "/usr/local/lib/python3.10/dist-packages/keras/src/losses.py", line 268, in call  **
    return ag_fn(y_true, y_pred, **self._fn_kwargs)
File "/usr/local/lib/python3.10/dist-packages/keras/src/losses.py", line 2122, in categorical_crossentropy
    return backend.categorical_crossentropy(
File "/usr/local/lib/python3.10/dist-packages/keras/src/backend.py", line 5560, in categorical_crossentropy
    target.shape.assert_is_compatible_with(output.shape)

ValueError: Shapes (None, None) and (None, None, None, 5) are incompatible