Open JingyeChen opened 7 months ago
I also encountered the same problem, have you solved it?
Hi @JingyeChen and @PF-Wu, have you solved the problems? I am guessing you might use the original video2dataset tool, which is not compatible to our csv files. Did you uninstall video2dataset and re-install it from this repo?
I uninstall video2dataset and pip install -e . at Panda-70M/dataset_dataloading/video2dataset, but it doesn't work
you might use invalid csv format file.
Hi @JingyeChen, @PF-Wu, @WANGCHENGQIAN0015: Could you please let me know which csv file you were downloading and what is the version of your PyArrow? It might be helpful for me to debug. Thanks!
I used panda70m_training_full.csv, It should be related to my cutting of csv file through pandas,No errors occur when I use the entire csv file
我用的是panda70m_training_full.csv,应该和我通过pandas切割csv文件有关,使用整个csv文件没有出现错误
I take back this answer,When I downloaded the entire document, I found that I also had this problem
same problem here. any solution? @tsaishien-chen
Hi @guangyliu Thanks for your interest. May I know which csv file are you using? and how did you get the csv file?
Hi @guangyliu Thanks for your interest. May I know which csv file are you using? and how did you get the csv file?
I tried both _panda70m_training2m.csv and _panda70m_trainingfull.csv , which are directly downloaded from the given google drive. And the video2dataset is from your repo. I just created a new conda enviroment with python=3.10 and pip install -e . under the video2dataset path.
Hi @guangyliu, Could you try validation or testing.csv first and let me know whether it can work or not? Thanks!
Same problem. I think you can reproduce the probelm by creating a new conda env with python=3.10 and directly pip install -e . under the video2dataset path.
And may I know which verison of python and pyarrow are you using?
@guangyliu: I am using Python 3.10.10
and pyarrow 14.0.1
. May I also know yours?
@tsaishien-chen accelerate 0.28.0 aiohttp 3.9.3 aiosignal 1.3.1 antlr4-python3-runtime 4.9.3 appdirs 1.4.4 asttokens 2.4.1 async-timeout 4.0.3 beautifulsoup4 4.12.3 bitsandbytes 0.43.0 braceexpand 0.1.7 Brotli 1.1.0 certifi 2024.2.2 cffi 1.16.0 charset-normalizer 3.3.2 click 8.1.7 cmake 3.28.3 datasets 2.17.1 decorator 5.1.1 decord 0.6.0 dill 0.3.8 distlib 0.3.8 docker-pycreds 0.4.0 docopt 0.6.2 einops 0.7.0 evaluate 0.4.1 exceptiongroup 1.2.0 executing 2.0.1 ffmpeg-python 0.2.0 filelock 3.13.1 fire 0.4.0 frozenlist 1.4.1 fsspec 2023.10.0 future 1.0.0 gdown 5.1.0 gitdb 4.0.11 GitPython 3.1.42 huggingface-hub 0.21.4 idna 3.6 ipdb 0.13.13 ipython 8.22.2 jedi 0.19.1 Jinja2 3.1.3 langdetect 1.0.9 lit 18.1.1 MarkupSafe 2.1.5 matplotlib-inline 0.1.6 mpmath 1.3.0 multidict 6.0.5 multiprocess 0.70.16 mutagen 1.47.0 networkx 3.2.1 numpy 1.26.4 nvidia-cublas-cu11 11.10.3.66 nvidia-cuda-cupti-cu11 11.7.101 nvidia-cuda-nvrtc-cu11 11.7.99 nvidia-cuda-runtime-cu11 11.7.99 nvidia-cudnn-cu11 8.5.0.96 nvidia-cufft-cu11 10.9.0.58 nvidia-curand-cu11 10.2.10.91 nvidia-cusolver-cu11 11.4.0.1 nvidia-cusparse-cu11 11.7.4.91 nvidia-nccl-cu11 2.14.3 nvidia-nvtx-cu11 11.7.91 omegaconf 2.3.0 opencv-python 4.9.0.80 packaging 24.0 pandas 2.2.1 parso 0.8.3 pexpect 4.9.0 pillow 10.2.0 pip 23.3.1 platformdirs 4.2.0 prompt-toolkit 3.0.43 protobuf 4.25.3 psutil 5.9.8 ptyprocess 0.7.0 pure-eval 0.2.2 pyarrow 14.0.1 pyarrow-hotfix 0.6 pycparser 2.21 pycryptodomex 3.20.0 Pygments 2.17.2 PySocks 1.7.1 python-dateutil 2.9.0.post0 pytz 2024.1 PyYAML 6.0.1 regex 2023.12.25 requests 2.31.0 responses 0.18.0 safetensors 0.4.2 scenedetect 0.6 scipy 1.12.0 sentry-sdk 1.42.0 setproctitle 1.3.3 setuptools 68.2.2 six 1.16.0 smmap 5.0.1 soundfile 0.12.1 soupsieve 2.5 stack-data 0.6.3 sympy 1.12 termcolor 2.4.0 timeout-decorator 0.5.0 tokenizers 0.13.3 tomli 2.0.1 torch 2.0.0 torchaudio 2.0.0 torchdata 0.6.0 tqdm 4.66.2 traitlets 5.14.2 transformers 4.30.0 triton 2.0.0 typing_extensions 4.10.0 tzdata 2024.1 urllib3 2.2.1 video2dataset 1.2.0 /lustre/scratch/Panda-70M/dataset_dataloading/video2dataset virtualenv 20.25.0 wandb 0.16.4 wcwidth 0.2.13 webdataset 0.2.86 websockets 12.0 webvtt-py 0.4.6 wheel 0.41.2 xxhash 3.4.1 yarl 1.9.4 yt-dlp 2024.3.10
@guangyliu: Hmm, it seems like the package version is not the problem. Could you please also paste your error message? I would like to know in which step you got that error. Thanks!
Traceback (most recent call last): File "/lustre/scratch/Panda-70M/dataset_dataloading/video2dataset/video2dataset/workers/download_worker.py", line 102, in call self.download_shard(row) File "/lustre/scratch/Panda-70M/dataset_dataloading/video2dataset/video2dataset/workers/download_worker.py", line 291, in download_shard sample_writer.close() File "/lustre/scratch/Panda-70M/dataset_dataloading/video2dataset/video2dataset/data_writer.py", line 321, in close self.buffered_parquet_writer.close() File "/lustre/scratch/Panda-70M/dataset_dataloading/video2dataset/video2dataset/data_writer.py", line 50, in close self.flush() File "/lustre/scratch/Panda-70M/dataset_dataloading/video2dataset/video2dataset/data_writer.py", line 45, in flush df = pa.Table.from_pydict(self.buffer, self.schema) File "pyarrow/table.pxi", line 1812, in pyarrow.lib._Tabular.from_pydict File "pyarrow/table.pxi", line 5292, in pyarrow.lib._from_pydict File "pyarrow/array.pxi", line 374, in pyarrow.lib.asarray File "pyarrow/array.pxi", line 344, in pyarrow.lib.array File "pyarrow/array.pxi", line 42, in pyarrow.lib._sequence_to_array File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status pyarrow.lib.ArrowTypeError: Expected bytes, got a 'list' object
The line number may be slightly different since I add some print command in the file. From my observation, the mp4 file can be downloaded successfully and the probelm appeared when the file number achieves _number_sample_pershard and save the video from buffer to disk. So if you want to reproduce the problem, you may change _number_sample_pershard to be a small number like 2 or 5.
Hi @guangyliu,
As you suggested, I tried to reduce number_sample_per_shard
and it can work.
I also tried to run on another machine with Python 3.10.10
and pyarrow 15.0.0
and it can work.
thanks, I have figured it out by reinstalling ffmpeg in conda.
@guangyliu Nice! May I know which version ffmpeg are you using now? As this seems like a common issue, I would like document this and your solution into readme. Thanks!
After investigation, I found there was always an exception, and the _errormessage is
"FileNotFoundError: [Errno 2] No such file or directory: 'ffmpeg'
Then I found this issue and this solution works for me. Now the ffmpeg version is:
conda list|grep ffmpeg ffmpeg 4.2.2 h20bf706_0
ffmpeg-python 0.2.0 pypi_0 pypi
I am having the same problem. my ffmpeg info is
ffmpeg 4.2.2 h20bf706_0 ffmpeg-python 0.2.0 pypi_0 pypi
I have tried with the test set and the downloading raised the error half way through the csv file
I have tried with the test set and the downloading raised the error half way through the csv file
This problem appears because the meta['clips'] is set as an empty list when there is an exception like here. For debugging, I suggest that you set _number_sample_pershard to be a small number like 2. And invesigate the _errormessage in the exception you meet.
Hi @MaydaY-Tsinghua, Have you fixed the problem by reinstalling ffmpeg as suggested here: https://github.com/snap-research/Panda-70M/issues/8#issuecomment-1998075778?
Hi @MaydaY-Tsinghua, Have you fixed the problem by reinstalling ffmpeg as suggested here: #8 (comment)?
I am trying to debug using the idea of https://github.com/snap-research/Panda-70M/issues/8#issuecomment-1998988591 but I ran into the issue Err403 this time before I can do anything else. I don't have this issue Err403 just a few days ago.
Hi @MaydaY-Tsinghua, Have you fixed the problem by reinstalling ffmpeg as suggested here: #8 (comment)?
I am trying to debug using the idea of #8 (comment) but I ran into the issue Err403 this time before I can do anything else. I don't have this issue Err403 just a few days ago.
It seems like a problem with my own IP as I can not download any video from youtube
Hi @MaydaY-Tsinghua, error 403 represents an IP problem. Please try to download the dataset by a proxy.
I also encountered the same problem, have you solved it?
@guangyliu May I ask how you solved this problem
Hi @18756164789, Thanks for your interest on Panda-70M dataset! This is a known issue. Have you tried to update ffmpeg package by pip or conda like this?
Have the same issue. Ubuntu: 22.04 ffmpeg: 4.4.2 ffmpeg-python: 0.2.0
tried each solution in the tread but had no luck.
@guangyliu May I ask how you solved this problem
As I said here, this problem is led by some exceptions. I think the specific exception is case-by-case, and it may be different. But overall, you should find the specific exception and then to solve the problem.
Hi @18756164789 and @xiuxiu, Sorry for your inconvenience. Have you solved this issue?
I am still getting this error
File "/mnt/disk/Panda-70M/dataset_dataloading/video2dataset/video2dataset/data_writer.py", line 44, in flush df = pa.Table.from_pydict(self.buffer, self.schema) File "pyarrow/table.pxi", line 1813, in pyarrow.lib._Tabular.from_pydict File "pyarrow/table.pxi", line 5356, in pyarrow.lib._from_pydict File "pyarrow/array.pxi", line 374, in pyarrow.lib.asarray File "pyarrow/array.pxi", line 344, in pyarrow.lib.array File "pyarrow/array.pxi", line 42, in pyarrow.lib._sequence_to_array File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status pyarrow.lib.ArrowTypeError: Expected bytes, got a 'list' object