Open MattUnderscoreZhang opened 9 months ago
can you rebase please ?
@MattUnderscoreZhang what speed difference do you observe? sharing wandb links would be helpful
I'm showing my noob status here, but I've never really used wandb. Could we maybe schedule a call sometime where you show me how to generate the links you're looking for?
Also, I don't think there should be very much speedup at this stage anyways. I still need to perform a temporary write at the beginning of sample processing, since I haven't touched the dataloaders yet. That is, it would be nice if the dataloaders passed filepaths rather than byte streams, but this will have to be a later change. As it stands, I think I'm currently only saving a single read/write. The major savings should be in the next pull request, when I update the actual video processing subsamplers.
I think it's quite important to check the speed for this kind of major change It could get worse
Here's a check of this branch vs. main on WandB. https://wandb.ai/bucketoffish/video2dataset?workspace=user-bucketoffish good-cherry-2 is this branch, which you can see is slightly faster (11.8% more vid_per_sec). This test was 100 webvid videos, with 5 processes and 10 threads, 10 samples per shard, on my 2019 MacBook Pro.
Here's a comparison between this branch and the threading_fix branch. This is with a larger dataset, around 5k samples. You can see the results are reversed here, with this branch being slightly slower, by about 4%. Each of these branches took about 2.5 hours to run. I'm not sure how significant these results are (in terms of variance). The difference between the last test and this one may be due to the threading fix, which has not been merged into the video_metadata_no_io branch yet.
@iejMac do you have some numbers of how many vid/s you reached on webvid ?
@MattUnderscoreZhang how many workers are you using?
I'm using 5 processes with 2 threads each. 100 samples per shard.
subsampling:
FrameSubsampler:
args:
frame_rate: 8
ResolutionSubsampler:
args:
width: 128
height: 224
resize_mode: "scale,crop,pad"
CutDetectionSubsampler:
cuts_are_clips: True
args:
cut_detection_mode: "all"
framerates: null
threshold: 27
min_scene_len: 15
ClippingSubsampler:
args:
min_length: 2.125
max_length: 2.125
max_length_strategy: "all"
precision: "exact"
reading:
yt_args:
download_size: 360
download_audio_rate: 44100
yt_metadata_args: null
timeout: 60
sampler: null
storage:
number_sample_per_shard: 100
oom_shard_count: 5
captions_are_subtitles: False
distribution:
processes_count: 5
thread_count: 2
subjob_size: 1000
distributor: "multiprocessing"
@rom1504 says at the bottom of this - https://github.com/iejMac/video2dataset/blob/main/dataset_examples/WebVid.md
230 video/s (14.4 videos/s/core) or 420 Mb/s
That seems to be for a download config with no video processing. My changes would not have any effect in that use case.
@MattUnderscoreZhang ok let's try to run with the same settings as in that example
Also it would be helpful to increase the number of processes and threads
5 and 2 are too low to catch problems
I tried replicating a run with the exact config used in the linked example. 16 processes with 16 threads each, with 1000 samples per shard. I ran on a vast.ai instance with a webvid results_2M_val dataset.
It's good we ran this test because unfortunately, it looks like this branch is definitely buggy. Something about the threading and multiprocessing is causing the run to freeze.
Looking at old commits and comparing to commit c6f3ed2 (the one right before my first commit), I see that the download worker refactor commit also has a threading problem (even with the threading_fix branch applied). The commit right before, e1b5d89, is fine. The speed comparison for this commit matches the older commit:
For now I recommend rolling back the download worker refactor commit. Fixing the threading issue will take some debugging, but I don't think I have the capacity for it right now. You can close this pull request if you want, and I'll come back and review this later.
This would need a rebase
Should be merged after #288.
Metadata-finding subsamplers (FFProbeSubsampler and CutDetectionSubsampler) no longer take byte streams, write to a temp file, then operate on the temp file. Rather, we can pass a filepath directly to these subsamplers, and they will extract metadata without performing additional I/O operations.
After this I will do the same with the video processing subsamplers in the next pull request.
This pull request has been tested with my usual workflow, reproducing expected results.