Video metadata processing no longer writes temp files

MattUnderscoreZhang commented 9 months ago

Should be merged after #288.

Metadata-finding subsamplers (FFProbeSubsampler and CutDetectionSubsampler) no longer take byte streams, write to a temp file, then operate on the temp file. Rather, we can pass a filepath directly to these subsamplers, and they will extract metadata without performing additional I/O operations.

After this I will do the same with the video processing subsamplers in the next pull request.

This pull request has been tested with my usual workflow, reproducing expected results.

rom1504 commented 9 months ago

can you rebase please ?

rom1504 commented 9 months ago

@MattUnderscoreZhang what speed difference do you observe? sharing wandb links would be helpful

MattUnderscoreZhang commented 9 months ago

I'm showing my noob status here, but I've never really used wandb. Could we maybe schedule a call sometime where you show me how to generate the links you're looking for?

Also, I don't think there should be very much speedup at this stage anyways. I still need to perform a temporary write at the beginning of sample processing, since I haven't touched the dataloaders yet. That is, it would be nice if the dataloaders passed filepaths rather than byte streams, but this will have to be a later change. As it stands, I think I'm currently only saving a single read/write. The major savings should be in the next pull request, when I update the actual video processing subsamplers.

rom1504 commented 9 months ago

I think it's quite important to check the speed for this kind of major change It could get worse

MattUnderscoreZhang commented 9 months ago

Here's a check of this branch vs. main on WandB. https://wandb.ai/bucketoffish/video2dataset?workspace=user-bucketoffish good-cherry-2 is this branch, which you can see is slightly faster (11.8% more vid_per_sec). This test was 100 webvid videos, with 5 processes and 10 threads, 10 samples per shard, on my 2019 MacBook Pro.

MattUnderscoreZhang commented 9 months ago

Here's a comparison between this branch and the threading_fix branch. This is with a larger dataset, around 5k samples. You can see the results are reversed here, with this branch being slightly slower, by about 4%. Each of these branches took about 2.5 hours to run. I'm not sure how significant these results are (in terms of variance). The difference between the last test and this one may be due to the threading fix, which has not been merged into the video_metadata_no_io branch yet.

rom1504 commented 9 months ago

@iejMac do you have some numbers of how many vid/s you reached on webvid ?

@MattUnderscoreZhang how many workers are you using?

MattUnderscoreZhang commented 9 months ago

I'm using 5 processes with 2 threads each. 100 samples per shard.

subsampling:
    FrameSubsampler:
        args:
           frame_rate: 8
    ResolutionSubsampler:
        args:
            width: 128
            height: 224
            resize_mode: "scale,crop,pad"
    CutDetectionSubsampler:
        cuts_are_clips: True
        args:
            cut_detection_mode: "all"
            framerates: null
            threshold: 27
            min_scene_len: 15
    ClippingSubsampler:
        args:
            min_length: 2.125
            max_length: 2.125
            max_length_strategy: "all"
            precision: "exact"

reading:
    yt_args:
        download_size: 360
        download_audio_rate: 44100
        yt_metadata_args: null
    timeout: 60
    sampler: null

storage:
    number_sample_per_shard: 100
    oom_shard_count: 5
    captions_are_subtitles: False

distribution:
    processes_count: 5
    thread_count: 2
    subjob_size: 1000
    distributor: "multiprocessing"

iejMac commented 9 months ago

@rom1504 says at the bottom of this - https://github.com/iejMac/video2dataset/blob/main/dataset_examples/WebVid.md

230 video/s (14.4 videos/s/core) or 420 Mb/s

MattUnderscoreZhang commented 9 months ago

That seems to be for a download config with no video processing. My changes would not have any effect in that use case.

rom1504 commented 9 months ago

@MattUnderscoreZhang ok let's try to run with the same settings as in that example

Also it would be helpful to increase the number of processes and threads

5 and 2 are too low to catch problems

MattUnderscoreZhang commented 9 months ago

I tried replicating a run with the exact config used in the linked example. 16 processes with 16 threads each, with 1000 samples per shard. I ran on a vast.ai instance with a webvid results_2M_val dataset.

It's good we ran this test because unfortunately, it looks like this branch is definitely buggy. Something about the threading and multiprocessing is causing the run to freeze.

Looking at old commits and comparing to commit c6f3ed2 (the one right before my first commit), I see that the download worker refactor commit also has a threading problem (even with the threading_fix branch applied). The commit right before, e1b5d89, is fine. The speed comparison for this commit matches the older commit:

For now I recommend rolling back the download worker refactor commit. Fixing the threading issue will take some debugging, but I don't think I have the capacity for it right now. You can close this pull request if you want, and I'll come back and review this later.

rom1504 commented 8 months ago

This would need a rebase

iejMac / video2dataset

Video metadata processing no longer writes temp files #293