snap-research / Panda-70M

[CVPR 2024] Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
https://snap-research.github.io/Panda-70M/
438 stars 15 forks source link

Downloading get stuck at some particular points #21

Open fgvfgfg564 opened 3 months ago

fgvfgfg564 commented 3 months ago

I tried to download the dataset and it get stuck when I downloaded 13.1G files. The command line just get stuck with no updates, and network stat shows that it has also stopped. Have no idea about what happened. Perhaps some entry point in the csv file causes this?

I shuffled the csv file several times. Each time the stop point is different, ranging from 11G to 14G.

We have tried downloading on Windows and WSL, both leads to the same error. There's no problem with network or disk.

tsaishien-chen commented 3 months ago

Hi @fgvfgfg564, is there any error messages?

fgvfgfg564 commented 3 months ago

There is no error message. The downloading process just simply got stuck there. We're guessing that perhaps some of the data points were too long and exceeded the upper limit of commands that windows can support, resulting in the error.

Secant1998 commented 3 months ago

Hi @fgvfgfg564, is there any error messages?

Hello, I am the co-worker of the questioner, thanks for your reply. There is no error message during the download and the same situation occurs in Ubuntu 22.04. However, our later tests revealed that CSV file was not the problem.

kuno989 commented 3 months ago

@tsaishien-chen At first I thought it was a CPU usage issue, but it wasn't. [because CPU usage goes up to 80~90% initially] I tried running it with 64 cores 64GM RAM[instance] and 16 threads (on config file), and the CPU usage decreased from about 40 to 60 minutes before converging to 0%. (Perhaps the program has stuck) I didn't get any errors in the process.

@AliaksandrSiarohin @Secant1998 Have you guys resolved the issue?

tsaishien-chen commented 3 months ago

Hi @fgvfgfg564, @Secant1998, @kuno989, Sorry for your inconvenience. Which csv file were you downloading then? Will the testing and validation set downloading also give you the same problem? Also, when you killed the stuck process, did you notice which line does the process stuck at?

kuno989 commented 3 months ago

Hi @tsaishien-chen I haven't exactly tested the issue I commented above, but I have logged usage when using video2dataset. This can be due to too much data or poor memory or CPU management of video2dataset.

Here is a graph of usage when downloading the panda70m_training_full.csv data through Spark. [OCPU: 64 (128) 158GB Ram] * 10 instance

subsampling: {}

reading:
    yt_args:
        download_size: 720 # I had the same issue at 480 resolution, so I don't think it's a quality issue.
        download_audio: True
        yt_metadata_args:
            writesubtitles:  True
            subtitleslangs: ['en']
            writeautomaticsub: True
            get_info: True
    timeout: 60
    sampler: null

storage:
    number_sample_per_shard: 100
    oom_shard_count: 5
    captions_are_subtitles: False

distribution:
    processes_count: 16
    thread_count: 16
    subjob_size: 10000
    distributor: "pyspark"
스크린샷 2024-03-21 오전 1 37 41

Usage of panda70m_testing.csv in the same configuration.

스크린샷 2024-03-21 오전 1 39 06
tsaishien-chen commented 3 months ago

Hi @kuno989, Thanks for providing the elaborate investigation! I am assuming that downloading gets stuck due to hardware overloading. Does this issue also happen when you download the panda70m_testing.csv? And have you tried to reduce number of parallel processes? Could this help? As it seems like a major issue that lots of people have encountered, I would like to know more about this and document this problem and solution into readme. Thanks for letting me know the issue and providing very useful information!

kuno989 commented 3 months ago

Hi @tsaishien-chen , I have not been able to get a proper test using testing.csv, but I have been able to test using full_train.csv. Here is the CPU usage when using full_train.

스크린샷 2024-03-22 오전 11 44 53

RAM

스크린샷 2024-03-22 오전 11 49 11

As you can see, the CPU usage spikes up to 90% at the beginning, but after a certain time it does very little work. I'm using 16 threads on config file.

Below is htop when CPU usage drops.

스크린샷 2024-03-22 오전 11 41 00

I'm currently checking and it's working, but it seems to be threadlocked at a certain moment, what do you think? This is an instance of ubuntu 20.04 with 64 core, 64 GB specs.

here is version information ffmpeg version 4.2.7-0ubuntu0.1 yt-dlp 2024.03.10

+++++++

I ran a total of 48 hours of testing, with the following results I think it might be an issue with yt-dlp. I don't know the exact reason, but when I look at it in tmux, I see that it is stuck working in yt-dlp. This seems to be caused by excessive CPU or RAM usage. This issue does not occur when downloading other YouTube-based datasets.

cpu, ram

스크린샷 2024-03-23 오전 12 58 46 스크린샷 2024-03-23 오전 12 59 24

Because when I exit video2dataset with ctrl + c, it works momentarily.

스크린샷 2024-03-23 오전 1 04 07
tsaishien-chen commented 3 months ago

Hi @kuno989, All test you showed above is the downloading of full train set, right? Have you tried to download test set? Does the same issue occur? After the downloading gets stuck, how many videos have you downloaded (in term of both the number and capacity of the downloaded videos)? You mentioned "This issue does not occur when downloading other YouTube-based datasets." Does that also test on the same machine? and which exact datasets have you tried? Again, sorry for your inconvenience and I am still investigating about this.

tsaishien-chen commented 3 months ago

If the downloading of test set (a smaller subset) can work, I think a way to fix this issue is: split the whole csv file into multiple smaller ones and download them by a bash script. But before that, I would like to check whether the same issue happens at a smaller dataset (e.g., test set).

kuno989 commented 3 months ago

As you can see in the wandb logs, video2dataset processed 5993 ~ 6100 pieces of data in 48 hours.

스크린샷 2024-03-23 오전 11 13 12

++++++ hi @tsaishien-chen, I have tried working with smaller sets, but the result is the same. I split it into 64 pieces and the same thing happened after a period of time.

Below are the results

스크린샷 2024-03-23 오후 9 57 42 스크린샷 2024-03-23 오후 9 57 53
itorone commented 3 months ago

Hi @tsaishien-chen I have encountered the same issue.We have tried three different datasets: training_full, training_2m, and training_10m. They get stuck after downloading some content, until we manually stop them with Ctrl+C. it seems that the problem is not caused by the CPU and RAM.

kuno989 commented 3 months ago

hi @tsaishien-chen Is there any update?

tsaishien-chen commented 3 months ago

Hi @itorone: When you terminated the processes, did you see at which lines does the code stuck by checking the command window or htop?

Hi @kuno989: For the screenshot below, is it captured after the processes get stuck? image If that is the case, I am guessing whether the code stucks when ffmpeg splits the video. As you mentioned: This issue does not occur when downloading other YouTube-based datasets. What exact YouTube-based datasets you have tried before? If none of the datasets run splitting, ffmpeg might be the reason that causes stuck.

kuno989 commented 3 months ago

This is a screenshot of when CPU utilization dropped. As mentioned above, I'm using ffmpeg version 4.2.7-0ubuntu0.1, if it's possible that ffmpeg is stuck, it could be a version issue, can you tell me what version you're using?

Here's another YouTube dataset that I'm using youtube-8m, hdvila-100M

tsaishien-chen commented 3 months ago

I used ffmpeg-4.4.1-amd64-static. But since your machine can work for hdvila-100m which also splits the video by ffmpeg, I don't think ffmpeg is the problem.

tsaishien-chen commented 3 months ago

Hi @fgvfgfg564 and @Secant1998, have you solved the problem? Could you please share how you fix the issue?

Qianjx commented 3 months ago

I had the same problem on one of my server and it didn't occur on another server. I find the problem is because after finishing downloading the video with yt_download, some threads occupied this file and gave the a lock thus the main python thread cannot continue reading this video file and waiting for this file free, and then it get stuck. One obvious case is when you try to ctrl C this python process, it would throw a file not found error while this file is already completely downloaded in your tmp dir.

Yet I don't have anything solutions to this problem and I cannot figure out which process occupies the downloaded video in the tmp dir. But I tried another way to solve it : now that the downloading is actually done, I can just skip this read operation here and continue downloading the rest files so the program won't get stuck. After downloading is finished, do the rest spilt and subsample operation.

@https://github.com/snap-research/Panda-70M/blob/main/dataset_dataloading/video2dataset/video2dataset/data_reader.py 267line

tsaishien-chen commented 3 months ago

Hi @Qianjx, Big thanks for the helpful information! Just to clarify: you found that the video is completely downloaded but the process will get stuck when it reads the video here: https://github.com/snap-research/Panda-70M/blob/10dd549a13e633d92c0ac3d7181594e9ab0d688c/dataset_dataloading/video2dataset/video2dataset/data_reader.py#L268 Is that correct? May I know your solution for that? Do you set a timeout so if the video cannot be read within time, just ignore that and continue processing the next video?

Hi @kuno989: Does this information help you solve the issue? And may I know your solution? Thanks!

kuno989 commented 3 months ago

Hi @tsaishien-chen, In my case, I took the idea from the above case and modified it in the following way. And it's currently downloading, I tested it yesterday for 12 hours with a single instance, and it worked fine, so I'm now doing a real download with spark. I think it should be downloading without problems now, but I'll see.

import portalocker
...
        streams = {}
        for modality, modality_path in modality_paths.items():
            try:
                with portalocker.Lock(modality_path, 'rb', timeout=125) as locked_file:
                    streams[modality] = locked_file.read()
                os.remove(modality_path)
            except portalocker.exceptions.LockException:
                print(f"Timeout occurred trying to lock the file: {modality_path}")
            except IOError as e:
                print(f"Failed to delete the file: {modality_path}. Error: {e}")

And you're right, at line 268, it's still occupying the system, so of course it can't do any more work, so the CPU and RAM utilization will drop over time.

Timeout occurred trying to lock the file: /sparkdata/tmp/a208b46b-d5f1-460f-891a-4606513370e1.mp4
Traceback (most recent call last):
  File "/opt/environment/lib/python3.10/site-packages/video2dataset/workers/download_worker.py", line 233, in download_shard
    subsampled_streams, metas, error_message = broadcast_subsampler(streams, meta)
  File "/opt/environment/lib/python3.10/site-packages/video2dataset/subsamplers/clipping_subsampler.py", line 237, in __call__
    return streams_clips, metadata_clips, None

If you have a better solution, please share it! Thanks!