iejMac / video2dataset

Easily create large video dataset from video urls
MIT License
533 stars 65 forks source link

Unexpected behaviour change after download worker refactor #302

Open pabl0 opened 8 months ago

pabl0 commented 8 months ago

Previously, video2dataset would create and download distribution.process_count shards at a time. So if I had 10 processes and, say, 32 threads configured, it would create ten .tar and .parquet files, and after finishing, the _stats.json file would be created (triggering my automation to upload the downloaded data elsewhere), and video2dataset would move to processing a new shard.

After the recent changes, it seems to be processing hundreds of shard at a time. Also, with process_count 4 and thread_count 16, I see some 300-400 threads running instead of the expected 64 or so. I don't think it works as designed, but if it does, how can I limit the number of shards processed at a time to make my download pipeline work smoothly?

rom1504 commented 8 months ago

@MattUnderscoreZhang could you please take a look?

Indeed this is definitely not expected

pabl0 commented 8 months ago

Reverting commit 83afef0 restored the old behavior for me. It seems that there is still much higher number of threads than processes_count × thread_count (haven't checked it before), but at least it creates the correct number of shards at a time.

Oh, BTW, the setting is called process_es__count, not process_count that might sound more correct (vs thread_count), for anyone grepping in the source tree. Confused me at first since the only occurrence was in the no_distributor() function.

rom1504 commented 8 months ago

Yeah I might have merged that a bit too fast. Will look more and we can probably revert if there's no easy fix

MattUnderscoreZhang commented 8 months ago

I’ll take a look as soon as I get home. It’s likely a change I made to the semaphores.

I’ll commit a fix in a new pull request, and see if I can add a relevant unit test.

MattUnderscoreZhang commented 8 months ago

@rom1504 I created a pull request that should limit processing to the correct number of shards at a time (as tested on my local environment). However, I see the same thing that @pabl0 does, where the number of threads created is higher than expected from processes_count x thread_count, going far back in the commit history. This seems to be a bug that has existed for a long time.

Testing with a 5k sample dataset, I set number_sample_per_shard=100, processes_count=5, thread_count=2: image

As a result, my config requires 50 shards, but I only download 5 at a time: image

As you mentioned, the number of threads is higher than expected: image

pabl0 commented 8 months ago

Thanks @rom1504 @MattUnderscoreZhang. Your fix does indeed seem to improve things in my quick testing. However, it seems to me that the current behavior is that it creates batches of N number of tar files, but it is only writing to one at a time. Like only one process is active and the rest are not doing any work, and the tar files remain empty.

BTW, is the output ".../_tmp/X.fether" caused by this print statement added by the worker refactor PR? It feels to me like unnecessary debug output that should be dropped.

https://github.com/iejMac/video2dataset/blob/83afef059ba1a29eb92bb6cb922f1f8e0ffd5965/video2dataset/workers/download_worker.py#L74

MattUnderscoreZhang commented 7 months ago

@pabl0 I can't seem to replicate your observation. What environment are you running on, and what configs are you using? For reference, here's what I see. With five processes, I see 5 tar files being written simultaneously.

image

image

MattUnderscoreZhang commented 7 months ago

Also, I removed the print statement on the threading_fix path. Nice catch.

pabl0 commented 7 months ago

I see. IIRC, the WebVid dataset consist of normal http links using different d/l backend than YouTube datasets. Is it possible they behave differently?