AlexandreKempf commented 7 months ago

Add bounding boxes

Context and motivations

Other repos PR related:

How to use

from dvclive import Live
import numpy as np

with Live() as live:
    numpy_img = np.zeros((240, 240, 3), dtype=np.uint8)
    bounding_boxes = {
        "boxes": [[1, 10, 30, 40], [200, 50, 240, 160]],
        "labels": ["person", "dog"],
        "scores": [0.8, 0.3],
        "format": "tlbr",
    }
    live.log_image("path.png", numpy_img, annotations=bounding_boxes)

The format field is used to specify the coordinates system for the bounding box.

"tlbr" means top, left, bottom right
"ltrb" means left, top, right, bottom
"tlhw" means top, left, height, width
"xywh" means center-x(horizontal), center-y(vertical), width, height They are the 4 more supported types.

:warning: The current PR doesn't support the relative coordinates (coordinates between 0 and 1). But I believe it is easy to add in a future PR.

How it works

The example above will save an image in "dvclive/plots/images/path.png" with numpy_img content. It will also create a JSON file alongside the image "dvclive/plots/images/path.json". The JSON file content will look like this:

{
    "annotations": {
        "person": [
            {
                "box": {
                    "top": 1,
                    "left": 10,
                    "bottom": 30,
                    "right": 40
                },
                "score": 0.8
            }
        ],
        "dog": [
            {
                "box": {
                    "top": 200,
                    "left": 50,
                    "bottom": 240,
                    "right": 160
                },
                "score": 0.3
            }
        ]
    }
}

Other than that the PR is ready to review :+1:

[x] ❗ I have followed the Contributing to DVCLive guide.
[x] 📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here. https://github.com/iterative/dvc.org/pull/5147

dberenbaum commented 7 months ago

Thanks @AlexandreKempf! Can you show an example of how to use it?

Why use a separate method rather than make it part of log_image? How do we ensure it's connected to the image?

AlexandreKempf commented 7 months ago

@dberenbaum and @shcheklein, I'll give you an example in the Yolo example, as you asked, and post some of it here for discussion when it is done.

@dberenbaum Concerning the log_image extension, I asked myself the same question initially.

pros of using `log_bounding_boxes`

I'm afraid we will flood the load_image with arguments. First, the bounding boxes, then the polygons/mask representation, then the segmentation... each takes a large json like structure and additional information on how to parse it (the format="tlbr" for instance).
In many cases, you want to see object detection on the validation set (a fixed set of images that are not augmented). This means that you want to see the same image, but you want to see how the bounding boxes evolved as the training goes. With the slider interface provided by our front end, I wanted us to be able to show the different epochs for the same image. That would be a nice feature for object detection users. If we use log_image for that, we'll have to save the same image several times. Using a second method, we are free of this problem. It is not a strong argument, as we could save the image every time also and it will still work.
It will simplify documentation for the bounding box logging. It will be on another page, so it will be easier to read.

pros of using `log_image`

wandb is doing it, and even if it is not a big argument in itself, following their lead may avoid going into problems we cannot see at our stage of development.
A user using log_image might discover the feature by looking at the log_image signature. I personally know that I don't watch the docs of the tools I was using frequently. So, this was my favorite way to discover new features ^^.
User will usually use log_bounding_boxes just after log_image, so why not merge them?

I have no hard feelings about each of these implementations, but I really wanted to start the task. Let me know if you see any additional arguments and if we @dberenbaum and @shcheklein come to an agreement.

dberenbaum commented 7 months ago

@AlexandreKempf Without having looked deeply through the PR yet, I'm still fuzzy on how we associate the bounding boxes with the images. An example (whether yolo or something else) will go a long way here and help show the pros and cons and we can decide what works.

AlexandreKempf commented 6 months ago

@dberenbaum Here is what I had in mind:

using `log_bounding_boxes`

from ultralytics import YOLO
from dvclive import Live

model = YOLO("yolov8n.pt") 
with live as Live():
    image_path = "https://ultralytics.com/images/bus.jpg"
    image_name = "image_bus"
    live.log_image(image_name, image)

    results = model(image)

    format = "tlbr"
    bboxes = results[0].boxes.xyxy.numpy()
    classes = results[0].boxes.cls.numpy()
    class_names = [results[0].names[class_index] for class_index in classes]
    scores = results[0].boxes.conf.numpy()
    live.log_bounding_boxes(image_name, boxes, class_names, scores, format=format)
    # or some dict processing then `live.log_bounding_boxes(image_name, image_path, boxes)`

using `log_image`

from ultralytics import YOLO
from dvclive import Live

model = YOLO("yolov8n.pt") 
with live as Live():
    image_path = "https://ultralytics.com/images/bus.jpg"
    image_name = "image_bus"

    results = model(image)

    format = "tlbr"
    bboxes = results[0].boxes.xyxy.numpy()
    classes = results[0].boxes.cls.numpy()
    class_names = [results[0].names[class_index] for class_index in classes]
    scores = results[0].boxes.conf.numpy()
    live.log_image(image_name, image_path, boxes, class_names, scores, format=format)
    # or some dict processing then `live.log_image(image_name, image_path, boxes)`

Image and bounding boxes can be matched by image_name in the Live object. Once the Python session is over, we can still match them by their path because they should have the same path but different suffixes (like you described here)

AlexandreKempf commented 6 months ago

Sidenote to @shcheklein. From what I understood of the Yolo W&B logger, they never log the bboxes and the image. They are just saving images with already bounding boxes on top of them (in the pixels I mean), and ultralitycs construct these images. So technically, we could be using this technique already :) Also, the ultralytics documentation on how W&B can display/hide bounding boxes based on their labels is probably not accurate since they don't have the bounding boxes information with ultralytics.

shcheklein commented 6 months ago

@AlexandreKempf sorry, I meant the Comet ML in this case https://github.com/ultralytics/ultralytics/blob/main/ultralytics/utils/callbacks/comet.py#L220 . It's the most complete logger for YOLO atm AFAIR.

nd ultralitycs construct these images. So technically, we could be using this technique already :)

we do this already, yep

shcheklein commented 6 months ago

They are using this call:

experiment.log_image(image_path, name=image_path.stem, step=curr_step, annotations=annotation)

And i like it tbh. It simple, it's clear what is happening. I like also that they are using annotations - there is a path to expand it beyond just bounding boxes.

dberenbaum commented 6 months ago

I like also that they are using annotations - there is a path to expand it beyond just bounding boxes.

Not a strong opinion, but discussed yesterday with @AlexandreKempf that this approach also has its downsides:

There is no way to know what format to include in annotations without going to their docs
You will have to write additional code to structure your data in that format

AlexandreKempf commented 6 months ago

I'm working on that but I won't push until I have something satisfying on the VScode plots. To keep you updated, I went for a solution that should satisfy all of us:

log_image(name, img, bboxes)

The format expected for bboxes is

{
    "boxes": [[1,2,3,4], [5,6,7,8],[10,11,12,13]],
    "labels": ["cat", "dog", "boat"]
    "scores": [0.1, 0.3, 0.8]
    "format": "tlbr"
}

For the argument name (bboxes or annotations) we need to pick. It won't change the current PR but it will affect the following ones on segmentation masks. I have a little preference for annotations because to add the segmentation mask we won't need to duplicate the labels information and it follows ultralytics and torchvision API more closely.

example using bboxes: log_image(name, img, bboxes: {"boxes": ..., "labels": ...}, masks: {"masks": ..., "labels": ...})

example using annotations: log_image(name, img, annotations: {"boxes": ..., "labels": ..., "masks": ...})

shcheklein commented 6 months ago

There is no way to know what format to include in annotations without going to their docs

true, but it is the same for bbox - I would have to go to docs to see what is expected

You will have to write additional code to structure your data in that format

yep, here I have no idea how complicated it is compared to that approach

so, no opinion on my end, just a thing to consider ...

shcheklein commented 6 months ago

@AlexandreKempf hey, is it ready to be reviewed? or is it still a draft ? (can we update the title and description please when you think it's ready to be reviewed)

AlexandreKempf commented 6 months ago

@shcheklein we are working with @julieg18 to have a working version from DVClive to VScode. But I believe the DVClive side of things is ready to review.

codecov-commenter commented 6 months ago

Codecov Report

Attention: Patch coverage is 94.28571% with 8 lines in your changes are missing coverage. Please review.

Project coverage is 95.31%. Comparing base (2c7c378) to head (901ab78).

Files	Patch %	Lines
src/dvclive/plots/annotations.py	88.88%	7 Missing and 1 partial :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #776 +/- ## ========================================== - Coverage 95.35% 95.31% -0.05% ========================================== Files 57 59 +2 Lines 3853 3989 +136 Branches 350 364 +14 ========================================== + Hits 3674 3802 +128 - Misses 126 133 +7 - Partials 53 54 +1 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

dberenbaum commented 6 months ago

Some questions I'm still unclear on:

Is there a working version in VS Code? What about handling this in DVC? How can experiments can be compared in VS Code and Studio with this info?
What does a yolo or torchvision example look like with this method?
Should we require all fields or should some be optional?

AlexandreKempf commented 6 months ago

Is there a working version in VS Code? What about handling this in DVC? How can experiments can be compared in VS Code and Studio with this info?

We are currently debugging one with @julieg18 but are close to getting something working perfectly well. I'll update a video of the final result by the end of day to demo how it works. Also, I'm going to open the PR in DVC to add the feature code.

What does a yolo or torchvision example look like with this method?

torchvision & lightning integration looks like this for the user:

class LightningModule(pl.LightningModule):

    # ... define `__init__` and `training_step`

    def validation_step(self, batch, batch_idxs):
        imgs, targets = batch

        # inference on validation images
        preds = self.forward(imgs)

        # log images with bounding boxes
        if batch_idxs == 0:
            live = self.logger.experiment
            for index, img in enumerate(imgs[:15]):
                prediction = preds[index]
                live.log_image(
                    f"val_images/{index}/{self.current_epoch}.png",
                    convert_image_to_np_array(img),
                    annotations={
                        "boxes": prediction["boxes"].cpu().numpy().astype(int),
                        "labels": [
                            self.class_names[i]
                            for i in prediction["labels"].cpu().numpy()
                        ],
                        "scores": np.around(prediction["scores"].cpu().numpy(), 3),
                        "format": "ltrb",
                    },
                )

        return

Note that this saves each image "A", "B", "C" into a structure that looks like this:

images/
    A/
        0.png
        0.json
        1.png
        1.json
        ...
    B/
    C/

Where 0, 1, ... are the validation number (=epoch number if we do validation at every epoch). It allows the step slider in VSCode and helps to see how the model learns.

For yolo integration, for the user it should looks like this:

from ultralytics import YOLO
from dvclive import Live

model = YOLO("yolov8n.pt")
with live as Live():
    image_path = "https://ultralytics.com/images/bus.jpg"
    image_name = "image_bus"

    results = model(image)

    # log image with bounding boxes
    format = "tlbr"
    boxes = results[0].boxes.xyxy.numpy()
    labels_idx = results[0].boxes.cls.numpy()
    labels = [results[0].names[idx] for idx in labels_idx]
    scores = results[0].boxes.conf.numpy()
    live.log_image(
        image_name,
        image_path,
        annotations={
            "boxes": boxes,
            "labels": labels,
            "scores": scores,
            "format": format,
        },
    )

Should we require all fields or should some be optional?

I wondered the same thing. I guess the first iteration should have all the fields needed, and we can always remove some hard constraints as we move on. It is way easier to do it this way than the other way around. If we start with optional fields and realize it is a mistake, it will be harder to revert and force the field. Options I could see, from useful to less useful (IMHO):

"format" could have a default and be optional
"score" could have a default to 1.0 for ground truth (target boxes in the dataset provided by human annotations)
"labels" could be optional for use case where there is only one label and the user doesn't want to see the label on each boxes displayed.

dberenbaum commented 6 months ago

I'm fine to move forward with this approach, but let's document the pros and cons once more so we can easily review the thought process in the future:

pros of using `log_bounding_boxes`

* I'm afraid we will flood the `load_image` with arguments. First, the bounding boxes, then the polygons/mask representation, then the segmentation... each takes a large `json` like structure and additional information on how to parse it (the `format="tlbr"` for instance).

This is mitigated by using a catchall annotations kwarg.

* In many cases, you want to see object detection on the validation set (a fixed set of images that are not augmented). This means that you want to see the same image, but you want to see how the bounding boxes evolved as the training goes. With the slider interface provided by our front end, I wanted us to be able to show the different epochs for the same image. That would be a nice feature for object detection users. If we use `log_image`  for that, we'll have to save the same image several times. Using a second method, we are free of this problem. It is not a strong argument, as we could save the image every time also and it will still work.

I don't see an easy way to do this regardless of which method we use, but maybe I'm missing something. We would need some way to capture bounding boxes per step.

* It will simplify documentation for the bounding box logging. It will be on another page, so it will be easier to read.

This is still a concern that we can revisit once @AlexandreKempf has drafted a docs PR.

There is no way to know what format to include in annotations without going to their docs

I think this would be easier in log_bbox() since the IDE could show the individual kwargs for boxes, labels, etc. We can partially mitigate this with types and docstrings (see here).

You will have to write additional code to structure your data in that format

Doesn't look like it makes much difference.

AlexandreKempf commented 6 months ago

@dberenbaum @shcheklein @skshetry

I used Pydantic at the end to the validation of user inputs. I would love to have this behavior:

class BBox(BaseModel):
    boxes: ...
    labels: ...
    scores: ...
    box_format: ... 

def my_sexy_function(bbox: BBox):
    ...

and call it with a dict

my_sexy_function({"boxes":..., "labels":..., "scores":...., "box_format": ...})

But I'm not sure it is possible with mypy.

So to stay consistent with what we said @dberenbaum, I created a typeddict so that users can see what fields are needed and their type. If they are still making a mistakes, pydantic errors should be enough to guide them through the perfect input. I realize it is a bit ugly to have both the TypedDict and the pydantic model. I strongly believe that DVClive should improve the user feeling (more understandable errors and warning, better types and docstrings ...) even if it comes with a maintenance cost on our end.

AlexandreKempf commented 6 months ago

Note: on the latest implementation Annotations.could_log is not called. I'm thinking of a better way to integrate it to the code.

mattseddon commented 6 months ago

Come back to this later

iterative / dvclive