voxel51 / fiftyone

Refine high-quality datasets and visual AI models
https://fiftyone.ai
Apache License 2.0
8.89k stars 563 forks source link

[BUG] UnicodeDecodeError when running add_yolo_labels() #1497

Closed valentindbdg closed 2 years ago

valentindbdg commented 2 years ago

System information

Commands to reproduce

dataset = fo.Dataset.from_dir(
    dataset_dir="/content/drive/MyDrive/yolodataset",
    dataset_type=fo.types.YOLOv4Dataset,
    label_field="ground_truth",
)
import fiftyone.utils.yolo as fouy
fouy.add_yolo_labels(
    sample_collection=dataset,
    label_field="predictions", 
    labels_path= "/content/DATASET/validation/data",
)

Describe the problem

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-56-0017c2b745f4> in <module>()
      3     sample_collection=dataset,
      4     label_field="predictions",
----> 5     labels_path= "/content/DATASET/validation/data",
      6 )

4 frames
/usr/lib/python3.7/codecs.py in decode(self, input, final)
    320         # decode input (taking the buffer into account)
    321         data = self.buffer + input
--> 322         (result, consumed) = self._buffer_decode(data, self.errors, final)
    323         # keep undecoded input until the next call
    324         self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

I am having this problem when trying to add yolo detections. Any known solution?

benjaminpkane commented 2 years ago

Your TXT files can't be decoded. I would double check what is in them. They should only contain numbers, per the format

valentindbdg commented 2 years ago

Thank you @benjaminpkane. All my files contain numbers, per the format, and it works when I load and visualize the dataset using the following command:

import fiftyone as fo

name = "pred-dataset"
dataset_dir = "/content/DATASET/validation"

dataset = fo.Dataset.from_dir(
    dataset_dir=dataset_dir,
    dataset_type=fo.types.YOLOv4Dataset,
    name=name,
    label_field = "predictions",
)

session = fo.launch_app(dataset)

However I'd like to add this dataset (predictions) to another one (ground truth). I get this error when trying using add_yolo_labels() . Any other way I can do that using the code above?

benjaminpkane commented 2 years ago

Interesting. We will try to reproduce.

valentindbdg commented 2 years ago

Thank you @benjaminpkane I can add the ground_truth to the predictions, but not the other way around. Therefore I cannot visualize all the images in my dataset that are part of ground_truth but do not have predictions on it.

If there is another solution to display the images while loading predictions:


dataset = fo.Dataset.from_dir(
    dataset_dir= "/content/DATASET/validation",
    dataset_type=fo.types.YOLOv4Dataset,
    label_field = "predictions",
)

then adding the ground truth but with ALL the images of ground_truth (not only the ones with predicitons):

fouy.add_yolo_labels(
    sample_collection=dataset,
    label_field="ground_truth", 
    labels_path= "/content/yolodataset/data",
)

session = fo.launch_app(dataset)

Then it could solve the problem I have. (both filepath contain all the images in the dataset)

benjaminpkane commented 2 years ago

Using merge might be alternative solution for you.

import fiftyone as fo

pred = fo.Dataset.from_dir(...)
gt = fo.Dataset.from_dir(...)

both = fo.Dataset("both")

both.merge(pred)
both.merge(gt)
ehofesmann commented 2 years ago

@valentindbdg There may be an issue with the images in /content/DATASET/validation/data being read instead of the TXT files. To test this, could you try to create a new directory and only copy over the TXT files from /content/DATASET/validation/data, then call add_yolo_labels() and set labels_path to the new directory?

Out of curiosity, what extension do the images have in your dataset?

brimoor commented 2 years ago

@valentindbdg I see the problem. In this syntax:

fouy.add_yolo_labels(
    sample_collection=dataset,
    label_field="predictions", 
    labels_path="/content/yolodataset/data",
)

the labels_path argument of add_yolo_labels() assumes that every file is a TXT file, but you have both images and TXT files in that directory.

There are a variety of ways to resolve this.

  1. Use the alternate add_yolo_labels() syntax where labels_path is a dict mapping image filenames to TXT filepaths:
import os
import eta.core.utils.as etau
import fiftyone.utils.yolo as fouy

labels_path = "/content/yolodataset/data"
labels_dict = {
    os.path.splitext(os.path.basename(p))[0] + ".jpg": p  # assumes your images are JPG
    for p in etau.list_files(labels_path, abs_paths=True, recursive=True)
    if p.endswith(".txt")
}

fouy.add_yolo_labels(
    sample_collection=dataset,
    label_field="predictions", 
    labels_path=labels_dict,
)
  1. Re-organize your files like this:
/path/to/images
    image1.ext
    image2.ext
    ...

/path/to/ground_truth
    image1.txt
    image2.txt
    ...

/path/to/predictions
    image1.txt
    image2.txt
    ...

and then load everything like this:

import fiftyone as fo
import fiftyone.utils.yolo as fouy

dataset = fo.Dataset.from_dir(
    data_path= "/path/to/images",
    labels_path="/path/to/ground_truth",
    dataset_type=fo.types.YOLOv4Dataset,
    label_field="ground_truth",
)

fouy.add_yolo_labels(dataset, "predictions", "/path/to/predictions")
  1. Load the two datasets separately and use merge_samples() per Ben's suggestion.
valentindbdg commented 2 years ago

Thank you @benjaminpkane and @brimoor ! I tried solution 2 and it worked well.

I also have confidence stored for my predictions, how can I input them to my dataset in Fiftyone so I can visualize them too? Should I open a new issue for this?

Note: I previously had them stored in a .csv file next to each prediction before conversion to yolo format:

frame | prediction_class | confidence | left_x | top_y | width | height
000000086755.jpg | person | 0.7 | 320 | 211 | 76 | 98
000000441468.jpg | person | 0.54 | 240 | 388 | 122 | 198
000000441468.jpg | person | 0.57 | 373 | 124 | 11 | 45
brimoor commented 2 years ago

Support for loading confidence from YOLO TXT files was just added in https://github.com/voxel51/fiftyone/pull/1465. It hasn't been released yet but you could use it via a source install.

However, since your data isn't natively stored in YOLO format but instead a CSV format you devised, I would instead recommend one of these approaches:

  1. Just write a simple Python loop that constructs your dataset from your CSV format
  2. Formalize 1 by writing a custom DatasetImporter that directly loads your CSV format. Here's an example of that
valentindbdg commented 2 years ago

Thank you @brimoor I did a source install in the google colab: https://github.com/voxel51/fiftyone#source-installs-in-google-colab

Then added a confidence column in the TXT files following this format: <target> <x-center> <y-center> <width> <height> <confidence>

it worked well.