kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
10.01k stars 905 forks source link

`GeneratorVideo` does not work #4301

Open BielStela opened 1 week ago

BielStela commented 1 week ago

Description

VideoDataset using a GeneratorVideo does not work

Context

I'm truing to create a video using GeneratorVideo to see if I can free up some memory. I already tried successfully with SequentialVideo (which works quite well btw) and refactored to use a generator that yields frames and a GeneratorVideo.

Steps to Reproduce

# catalog.yml
test_video:
  type: video.VideoDataset
  filepath: data/03_primary/test.mp4
# nodes.py
from collections.abc import Generator

from PIL import Image
from kedro_datasets.video.video_dataset import GeneratorVideo

def make_video() -> GeneratorVideo:
    """Makes a video with three frames: one red, one green and one blue at 1 fps"""
    def frames() -> Generator[Image.Image, None, None]:
        w, h = 256, 256
        red_frame = Image.new("RGB", (w, h), (255, 0, 0))
        green_frame = Image.new("RGB", (w, h), (0, 255, 0))
        blue_frame = Image.new("RGB", (w, h), (0, 0, 255))
        frames = [red_frame, green_frame, blue_frame]
        yield from frames

    return GeneratorVideo(frames(), length=None, fps=1)
# pipeline.py
from kedro.pipeline import Pipeline, pipeline, node

from .nodes import make_video

def create_pipeline(**kwargs) -> Pipeline:
    return pipeline([node(make_video, inputs=None, outputs="test_video")])

Expected Result

A colorful video similar to this one ( in the preview does not work, hope it does when published)

https://github.com/user-attachments/assets/9b0d1654-b2d3-454a-aef7-cb82ccfa68eb

Actual Result

This error!

kedro.io.core.DatasetError: Failed while saving data to dataset VideoDataset(filepath=<removed>, protocol=file).
'Image' object has no attribute 'fps'

If one changes the node to use a SequenceVideo like so:

def make_video() -> SequenceVideo:
    """Makes a video with three frames
        one red, one green and one blue at 1 fps"""
    def frames() -> list:
        w, h = 256, 256
        red_frame = Image.new("RGB", (w, h), (254, 0, 0))
        green_frame = Image.new("RGB", (w, h), (0, 254, 0))
        blue_frame = Image.new("RGB", (w, h), (0, 0, 254))
        frames = [red_frame, green_frame, blue_frame, blue_frame]
        return frames

    return SequenceVideo(frames(), fps=1)

It works well.

Now here it comes my debugging report: One can see that there's a moment when running the pipeline, when the program is at kedro.runner._run_node_sequential:528, the code does

        items = zip(it.cycle(keys), interleave(*streams))

where streams is a list containing my GeneratorVideo which gets iterated in the chaining. The problem is that the stream itself is an Iterator that gets crystallized into an iterator of Image.Image in the operation and iterated over while calling catalog.save(name, data). Then VideoDataset takes the control and fails instantly because the input is no longer a GeneratorVideo nor a SequenceVideo, it is now an Image

From here I have no more clue about how this can be fixed tho :_)

Your Environment

DimedS commented 5 days ago

Thanks for raising this issue, @BielStela. It appears that there are inconsistencies between how GeneratorVideo handles iteration and the VideoDataset save method. We may need to modify GeneratorVideo to support iteration in a way that aligns with VideoDataset. Would you be interested in proposing a PR to address this?

dundermain commented 21 hours ago

Hey @DimedS , let us wait for @BielStela response. If they are unable to raise it, I would like to work on this issue if it is okay with both of you.