voxel51 / fiftyone

The open-source tool for building high-quality datasets and computer vision models
https://fiftyone.ai
Apache License 2.0
7.93k stars 521 forks source link

fo.core.video.make_frames_dataset is sneakily considered a frame view #4397

Closed evatt-harvey-salinger closed 2 months ago

evatt-harvey-salinger commented 2 months ago

Describe the problem

fo.core.video.make_frames_dataset seems like it should create a basic Dataset, as opposed to  Dataset.to_frames which should make a FramesView. However, this line... (https://github.com/voxel51/fiftyone/blob/86125ab7851a1656fa46e0719edde8dd94f3c3eb/fiftyone/core/video.py#L635) ...results in Dataset._is_frames being True.

So even though the dataset is actually a type Dataset, functions that take its view() like annotate consider it a FramesView.

Is this expected behavior? I would assume that make_frames_dataset was specifically intended to create something that was distinct from a frames view, and would be treated as a normal dataset. (I'll note, even dataset.clone returns a dataset where _is_frames=False.)

Code to reproduce issue

vid_dataset = ... # contains video samples
new_dataset = fo.core.video.make_frames_dataset(vid_dataset)
print(new_dataset._is_frames) # True
dataset.annotate(...)
# ValueError: Annotating frames views is not supported
clone = new_dataset.clone
print(clone._is_frames) # False

System information

Other info/logs

Willingness to contribute

benjaminpkane commented 2 months ago

Hi @evatt-harvey-salinger. Under the hood, fo.core.video.make_frames_dataset is the function used to create a Dataset.to_frames() view. The term dataset is likely overloaded in this context, but the other to_* stages use the same nomenclature, e.g. fo.core.patches.make_patches_dataset and Dataset.to_patches()

Zooming out a bit, perhaps adding support (or a best practice) for annotating frame collections is the main goal?

brimoor commented 2 months ago

@evatt-harvey-salinger thanks for calling this out!

I think it is a valid use case to directly call methods like make_frames_dataset() and that, indeed, you should get a "regular" dataset when you do that. This will be supported as of https://github.com/voxel51/fiftyone/pull/4416.

In the meantime, it is slightly less efficient, but you can achieve the same end result via clone() like this:

patches_dataset = sample_collection.to_patches(...).clone()
frames_dataset = sample_collection.to_frames(...).clone()
clips_dataset = sample_collection.to_clips(...).clone()
evatt-harvey-salinger commented 2 months ago

Thanks @brimoor and @benjaminpkane!

Great, looks like #4416 will address the suggestion that make_frames_dataset() should return a "regular" dataset.

In general, I agree that adding support for annotating a FrameView of a video dataset would be an amazing feature. I can envision a few good use cases...

Currently, its seems that the workflow would be to use to_frames(...).clone to sample and annotate a subset of the video, and then maintain the video dataset alongside the "to_frames.clone" dataset. I could either (1) store the annotations in "to_frames.clone" dataset, and progressively sample more frames of the video, merging and labeling them into the "to_frames.clone" dataset in batches, or (2) store the annotations in the video dataset, by annotating the "to_frames.clone" dataset and then merging the annotations into the video frames by associating the frame_number's.

This is certainly doable. But if FrameViews could be annotated directly, and the annotations could be imported straight into the video dataset, it would prevent the need to flow back and forth between two datasets (and mitigate the risks of accidentally tweaking one dataset out of alignment to the other).

evatt-harvey-salinger commented 2 months ago

I'll close the issue, since #4416 addresses the original request. But I'd love to hear what you think more general idea of annotating FrameViews directly, so I'll stay tuned on the thread!

brimoor commented 2 months ago

Out of curiosity, is there a reason you specifically want to annotate your videos as individual frames rather than directly calling annotate() on your media_type == "video" dataset?

evatt-harvey-salinger commented 1 month ago

Hi Brian,

I've tried to answer this a few different times, but then I get new ideas and try to hack together a solution. But I haven't really found one yet.

Basically, I have many hours worth of 15 fps videos to label. Each video sample has wayyy to many frames to label all at once. I'd like to be able downsample and iteratively label portions of video datasets, while retaining the integrity of the video samples as videos (rather than just converting them into image datasets). That would enable me to annotate the videos at 1 fps, then come back and annotate at 4 fps. Or, use the 1 fps frames to training a model that can help me auto-label a portion of the unlabeled frames.

For example, I have a workflow with images datasets that looks like this: 1) request annotation for a view first_pass 2) retrieve annotations, and use the anno_results.frame_id_map to select the frame_ids to reconstruct the first_pass view (a capability we should add btw :) ) 3) programtically exchange label_requested tags for labeled tags 4) train a model on the labeled samples 5) run inference on the unlabeled samples and label them as auto-labeled 6) form a new view second_pass with a portion of the auto-labeled samples, where I correct the auto-labeled predictions 7) retrieve those annotations, and iterate

I would like to develop an analogous workflow for video datasets. Sending FrameViews for annotation, then retrieving the annotations and pulling directly into my video dataset would be the cleanest way enable this kind of workflow.

As I said, I've been trying to find a work around, but haven't been able to achieve a solution yet that isn't terribly convoluted. I know that I can just abandon the video datasets altogether and just convert everything to image datasets, but it would be a shame to not make use of the other video datasets capabilities. I would also like to keep the source files as videos, which are cleaner to store, version, view, etc.

I've gotten close to a solution where I maintain a video dataset and a corresponding image dataset as a pair. I can use the workflow above to add annotations to the images, then use the frame_numbers (a field automatically populated by make_frames_dataset()) to merge the annotations back into the video dataset. However, this has proven to be quite tricky.

evatt-harvey-salinger commented 1 month ago

I know that I can use the frame_step parameters in annotate with the CVAT backend. But if I use the tracks feature in CVAT, then the detections actually get interpolated once they are imported into the FO dataset anyways. For example, if I use a frame_step=8 for a 32 frame video, I would only label ~4 frames in CVAT. But after importing back into FO, all 32 frames are labelled.

frame_step can't be used for datasets that already have tracks anyways.

Because of these two things, I'm going to just live with label full fps videos (with whatever downsampling i want on the front end), and achieve "partial" annotation by just sending different clips within the video at a time.

evatt-harvey-salinger commented 1 month ago

Anyways, I hope this description give you an idea of the workflow I was trying to achieve by annotating FrameViews directly!

brimoor commented 1 month ago

@evatt-harvey-salinger I added support for passing frame views directly to annotate() in https://github.com/voxel51/fiftyone/pull/4477! 🎉

evatt-harvey-salinger commented 1 month ago

Wonderful. Thanks Brian!