Support for sub-sampling long videos

con-bren commented 11 months ago

🚀 Feature Request

I'm trying to add support for sub-sampling long videos in StreamingDataset. I have two possible implementation methods in mind, and I'd like your feedback on which is most inline with the spirit of the codebase.

The goal is to be able to define samples as subsamples of a larger file dependency (in this case a video). I imagine that this would be sequential in nature, so samples 0-1000 are from video1, 1001-2300 are from video2, etc. Each sample would map to the frames of the video to use.

Motivation

The video datasets we're working with are very large (>10 TB) so using a frame by frame jpeg approach is impractical. On the other hand, video decoding is quite slow, so we can't be in a situation where we waste any decoding time. My hope is that we can open a single video file (1 per thread), read ALL the subsamples from it in sequence, and then move to the next video.

[Optional] Implementation

I have two potential idea for how to implement such a system, but I'd like feedback on which fits with StreamingDataset the best.

1) Each video as a sample with iterable subsamples. This approach involves defining a sample in StreamingDataset as a single video, and then adding a bit of code to the getitem and iter functions to subsample from that video until all subsamples are exhausted.

2) Each video is a single shard in StreamingDataset, and the shuffle logic is updated to pull all samples from a single shard before moving to the next shard.

Additional context

Would it be okay to call: self.flush_shard() self._reset_cache() manually on MDSWriter to force it to start a new shard?

knighton commented 11 months ago

Seems like option 1 is easier, but you would of course incur significant and potentially variable sample loading times. Perhaps you could smooth that out by raising DataLoader prefetch factor, StreamingDataset predownload, etc.

For 2, recall that shuffling whole shards together as a unit is not any different or more special, from StreamingDataset's perspective, relative to any other arbitrary collection of variable length spans of samples. The shard is just a unit of download, the shuffle is just a array containing the global sample ID mapping, etc.

Why not serialize as 1 sample = 1 frame, take the shard boundaries wherever they fall, and implement your shuffle logic in terms of those individual frames? I have heard of some quite large image datasets working without issue. I feel like there is an argument to be made in favor of small, even sample and shard sizes, but don't have much to back it up with.

Please keep us posted!

knighton commented 11 months ago

Yeah sorry, there is a bit of object-orientation consternation in that Writer code, it's safe to force flush shards.

We endeavor to have things crash quickly and loudly when things are broken.

con-bren commented 11 months ago

Okay, I've started work on option 1, and it seems to be going okay. I'm wondering, however, what is the cleanest way to use add some predownload based processing. Specifically, after the shard is ready I want to start decoding the next item's video right away. I see that there are currently two processes (_prepare_thread and _ready_thread) that deal with this logic, but I don't really see an obvious way to extend their functionality beyond just overriding the whole function.

Also, we'll need to clear up the decoded video as soon as the item is no longer needed, which I suppose can just be done in the the _prepare_thread.

I see that in the webvid example there is another thread called _download_thread, but it's not clear to me how that thread gets initialized. If I could initialize another preprocessing thread without overriding iter that seems like the best solution.

snarayan21 commented 4 months ago

Hey @con-bren, were you able to resolve this?

mosaicml / streaming