kedro-org / kedro-plugins

First-party plugins maintained by the Kedro team.
Apache License 2.0
93 stars 89 forks source link

VideoDataSet with GeneratorVideo save is broken #232

Open daniel-falk opened 1 year ago

daniel-falk commented 1 year ago

Description

I wasn't sure if I should put this issue here or in the Kedro core repo since the VideoDataSet has been broken after a change in Kedro core.

Shortly, the video dataset has multiple backends that can be saved. One of them is the GeneratorVideo which is an Iterable.

In commit fcf3ab4a9 "Enable the usage of generator functions in nodes (#2161)" by @idanov the runner functionality was changed to handle generator datasets.

Snippet from _run_node_sequential in kedro/runner/runner.py:

    items: Iterable = outputs.items()
    # if all outputs are iterators, then the node is a generator node
    if all(isinstance(d, Iterator) for d in outputs.values()):
        # Python dictionaries are ordered so we are sure
        # the keys and the chunk streams are in the same order
        # [a, b, c]
        keys = list(outputs.keys())
        # [Iterator[chunk_a], Iterator[chunk_b], Iterator[chunk_c]]
        streams = list(outputs.values())
        # zip an endless cycle of the keys
        # with an interleaved iterator of the streams
        # [(a, chunk_a), (b, chunk_b), ...] until all outputs complete
        items = zip(it.cycle(keys), interleave(*streams))

    for name, data in items:
        hook_manager.hook.before_dataset_saved(dataset_name=name, data=data, node=node)
        catalog.save(name, data)
        hook_manager.hook.after_dataset_saved(dataset_name=name, data=data, node=node)
    return node

Context

I have not had time yet to dive into the code and see what is actually happening, or how we should treat it in these situations. The effect now is that the runner will yield the first frame from the video and try to save that frame to the Video Dataset which is not possible.

The quick and dirty way to fix it would be to remove the __iter__ method from the GeneratorVideo class and implement some special logic in the VideoDataSet class to iterate it correctly. This would however not be very nice from a user perspective since iter(video) would then result in an indexed-iteration of the generator.

Steps to Reproduce

  1. Create a GeneratorVideo
  2. Create a VideoDataSet
  3. Call save on the video dataset with the generator video
daniel-falk commented 1 year ago

Current workaround:

Note the comment "if all outputs are iterators, then the node is a generator node" - the issue only happens when all outputs from the node is Iterable, therefore we can create a dummy output to the node (e.g. a MemoryDataset) and save some garbage value ( e.g. an empty string) to it.

def my_node(input_video):
    def my_generator():
        ...
    return GeneratorVideo(my_generator(), ...), ""

node = node(
    my_node,
    inputs="my_video",
    outputs=[
        "my_output_ds",
        "dummy"
    ],
    name="Create video dataset",
)
noklam commented 1 year ago

Didn't have time to check this now, just quick question is this covered in the unit tests?

daniel-falk commented 1 year ago

@noklam Probably not, and the question is where it should be covered. The video dataset works as expected with the framework design as it was when I implemented the dataset. The commit linked above changed the behaviour of Kedro, so sure, there could have been a unit test of the runner that verified that Iterable datasets are possible to save, but it is a very niche thing to test.

The change is however already in and there has been multiple releases since then. Since the change was user-facing I suspect it will not be changed. Therefore I guess the only solution to this is to update the dataset to work with the new behaviour of the runner.

I also think that the new changes to the runner looks very interesting, so it might be something we can make use of to make the video dataset even better.

merelcht commented 8 months ago

If anyone comes across this and would like to open a PR, please go ahead 🙂