Open fgvfgfg564 opened 8 months ago
Hi @fgvfgfg564, is there any error messages?
There is no error message. The downloading process just simply got stuck there. We're guessing that perhaps some of the data points were too long and exceeded the upper limit of commands that windows can support, resulting in the error.
Hi @fgvfgfg564, is there any error messages?
Hello, I am the co-worker of the questioner, thanks for your reply. There is no error message during the download and the same situation occurs in Ubuntu 22.04. However, our later tests revealed that CSV file was not the problem.
@tsaishien-chen At first I thought it was a CPU usage issue, but it wasn't. [because CPU usage goes up to 80~90% initially] I tried running it with 64 cores 64GM RAM[instance] and 16 threads (on config file), and the CPU usage decreased from about 40 to 60 minutes before converging to 0%. (Perhaps the program has stuck) I didn't get any errors in the process.
@AliaksandrSiarohin @Secant1998 Have you guys resolved the issue?
Hi @fgvfgfg564, @Secant1998, @kuno989, Sorry for your inconvenience. Which csv file were you downloading then? Will the testing and validation set downloading also give you the same problem? Also, when you killed the stuck process, did you notice which line does the process stuck at?
Hi @tsaishien-chen I haven't exactly tested the issue I commented above, but I have logged usage when using video2dataset. This can be due to too much data or poor memory or CPU management of video2dataset.
Here is a graph of usage when downloading the panda70m_training_full.csv data through Spark. [OCPU: 64 (128) 158GB Ram] * 10 instance
subsampling: {}
reading:
yt_args:
download_size: 720 # I had the same issue at 480 resolution, so I don't think it's a quality issue.
download_audio: True
yt_metadata_args:
writesubtitles: True
subtitleslangs: ['en']
writeautomaticsub: True
get_info: True
timeout: 60
sampler: null
storage:
number_sample_per_shard: 100
oom_shard_count: 5
captions_are_subtitles: False
distribution:
processes_count: 16
thread_count: 16
subjob_size: 10000
distributor: "pyspark"
Usage of panda70m_testing.csv in the same configuration.
Hi @kuno989, Thanks for providing the elaborate investigation! I am assuming that downloading gets stuck due to hardware overloading. Does this issue also happen when you download the panda70m_testing.csv? And have you tried to reduce number of parallel processes? Could this help? As it seems like a major issue that lots of people have encountered, I would like to know more about this and document this problem and solution into readme. Thanks for letting me know the issue and providing very useful information!
Hi @tsaishien-chen , I have not been able to get a proper test using testing.csv, but I have been able to test using full_train.csv. Here is the CPU usage when using full_train.
RAM
As you can see, the CPU usage spikes up to 90% at the beginning, but after a certain time it does very little work. I'm using 16 threads on config file.
Below is htop when CPU usage drops.
I'm currently checking and it's working, but it seems to be threadlocked at a certain moment, what do you think? This is an instance of ubuntu 20.04 with 64 core, 64 GB specs.
here is version information ffmpeg version 4.2.7-0ubuntu0.1 yt-dlp 2024.03.10
+++++++
I ran a total of 48 hours of testing, with the following results I think it might be an issue with yt-dlp. I don't know the exact reason, but when I look at it in tmux, I see that it is stuck working in yt-dlp. This seems to be caused by excessive CPU or RAM usage. This issue does not occur when downloading other YouTube-based datasets.
cpu, ram
Because when I exit video2dataset with ctrl + c, it works momentarily.
Hi @kuno989, All test you showed above is the downloading of full train set, right? Have you tried to download test set? Does the same issue occur? After the downloading gets stuck, how many videos have you downloaded (in term of both the number and capacity of the downloaded videos)? You mentioned "This issue does not occur when downloading other YouTube-based datasets." Does that also test on the same machine? and which exact datasets have you tried? Again, sorry for your inconvenience and I am still investigating about this.
If the downloading of test set (a smaller subset) can work, I think a way to fix this issue is: split the whole csv file into multiple smaller ones and download them by a bash script. But before that, I would like to check whether the same issue happens at a smaller dataset (e.g., test set).
As you can see in the wandb logs, video2dataset processed 5993 ~ 6100 pieces of data in 48 hours.
++++++ hi @tsaishien-chen, I have tried working with smaller sets, but the result is the same. I split it into 64 pieces and the same thing happened after a period of time.
Below are the results
Hi @tsaishien-chen I have encountered the same issue.We have tried three different datasets: training_full, training_2m, and training_10m. They get stuck after downloading some content, until we manually stop them with Ctrl+C. it seems that the problem is not caused by the CPU and RAM.
hi @tsaishien-chen Is there any update?
Hi @itorone: When you terminated the processes, did you see at which lines does the code stuck by checking the command window or htop?
Hi @kuno989: For the screenshot below, is it captured after the processes get stuck? If that is the case, I am guessing whether the code stucks when ffmpeg splits the video. As you mentioned: This issue does not occur when downloading other YouTube-based datasets. What exact YouTube-based datasets you have tried before? If none of the datasets run splitting, ffmpeg might be the reason that causes stuck.
This is a screenshot of when CPU utilization dropped. As mentioned above, I'm using ffmpeg version 4.2.7-0ubuntu0.1, if it's possible that ffmpeg is stuck, it could be a version issue, can you tell me what version you're using?
Here's another YouTube dataset that I'm using youtube-8m, hdvila-100M
I used ffmpeg-4.4.1-amd64-static. But since your machine can work for hdvila-100m which also splits the video by ffmpeg, I don't think ffmpeg is the problem.
Hi @fgvfgfg564 and @Secant1998, have you solved the problem? Could you please share how you fix the issue?
I had the same problem on one of my server and it didn't occur on another server. I find the problem is because after finishing downloading the video with yt_download, some threads occupied this file and gave the a lock thus the main python thread cannot continue reading this video file and waiting for this file free, and then it get stuck. One obvious case is when you try to ctrl C this python process, it would throw a file not found error while this file is already completely downloaded in your tmp dir.
Yet I don't have anything solutions to this problem and I cannot figure out which process occupies the downloaded video in the tmp dir. But I tried another way to solve it : now that the downloading is actually done, I can just skip this read operation here and continue downloading the rest files so the program won't get stuck. After downloading is finished, do the rest spilt and subsample operation.
Hi @Qianjx, Big thanks for the helpful information! Just to clarify: you found that the video is completely downloaded but the process will get stuck when it reads the video here: https://github.com/snap-research/Panda-70M/blob/10dd549a13e633d92c0ac3d7181594e9ab0d688c/dataset_dataloading/video2dataset/video2dataset/data_reader.py#L268 Is that correct? May I know your solution for that? Do you set a timeout so if the video cannot be read within time, just ignore that and continue processing the next video?
Hi @kuno989: Does this information help you solve the issue? And may I know your solution? Thanks!
Hi @tsaishien-chen, In my case, I took the idea from the above case and modified it in the following way. And it's currently downloading, I tested it yesterday for 12 hours with a single instance, and it worked fine, so I'm now doing a real download with spark. I think it should be downloading without problems now, but I'll see.
import portalocker
...
streams = {}
for modality, modality_path in modality_paths.items():
try:
with portalocker.Lock(modality_path, 'rb', timeout=125) as locked_file:
streams[modality] = locked_file.read()
os.remove(modality_path)
except portalocker.exceptions.LockException:
print(f"Timeout occurred trying to lock the file: {modality_path}")
except IOError as e:
print(f"Failed to delete the file: {modality_path}. Error: {e}")
And you're right, at line 268, it's still occupying the system, so of course it can't do any more work, so the CPU and RAM utilization will drop over time.
Timeout occurred trying to lock the file: /sparkdata/tmp/a208b46b-d5f1-460f-891a-4606513370e1.mp4
Traceback (most recent call last):
File "/opt/environment/lib/python3.10/site-packages/video2dataset/workers/download_worker.py", line 233, in download_shard
subsampled_streams, metas, error_message = broadcast_subsampler(streams, meta)
File "/opt/environment/lib/python3.10/site-packages/video2dataset/subsamplers/clipping_subsampler.py", line 237, in __call__
return streams_clips, metadata_clips, None
If you have a better solution, please share it! Thanks!
I tried to download the dataset and it get stuck when I downloaded 13.1G files. The command line just get stuck with no updates, and network stat shows that it has also stopped. Have no idea about what happened. Perhaps some entry point in the csv file causes this?
I shuffled the csv file several times. Each time the stop point is different, ranging from 11G to 14G.
We have tried downloading on Windows and WSL, both leads to the same error. There's no problem with network or disk.