snap-research / Panda-70M

[CVPR 2024] Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
https://snap-research.github.io/Panda-70M/
505 stars 19 forks source link

Has anyone encountered the pyarrow.lib.ArrowTypeError when downloading? #8

Open JingyeChen opened 7 months ago

JingyeChen commented 7 months ago

File "/mnt/disk/Panda-70M/dataset_dataloading/video2dataset/video2dataset/data_writer.py", line 44, in flush df = pa.Table.from_pydict(self.buffer, self.schema) File "pyarrow/table.pxi", line 1813, in pyarrow.lib._Tabular.from_pydict File "pyarrow/table.pxi", line 5356, in pyarrow.lib._from_pydict File "pyarrow/array.pxi", line 374, in pyarrow.lib.asarray File "pyarrow/array.pxi", line 344, in pyarrow.lib.array File "pyarrow/array.pxi", line 42, in pyarrow.lib._sequence_to_array File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status pyarrow.lib.ArrowTypeError: Expected bytes, got a 'list' object

PF-Wu commented 7 months ago

I also encountered the same problem, have you solved it?

tsaishien-chen commented 6 months ago

Hi @JingyeChen and @PF-Wu, have you solved the problems? I am guessing you might use the original video2dataset tool, which is not compatible to our csv files. Did you uninstall video2dataset and re-install it from this repo?

WANGCHENGQIAN0015 commented 6 months ago

I uninstall video2dataset and pip install -e . at Panda-70M/dataset_dataloading/video2dataset, but it doesn't work

wowfingerlicker commented 6 months ago

you might use invalid csv format file.

tsaishien-chen commented 6 months ago

Hi @JingyeChen, @PF-Wu, @WANGCHENGQIAN0015: Could you please let me know which csv file you were downloading and what is the version of your PyArrow? It might be helpful for me to debug. Thanks!

WANGCHENGQIAN0015 commented 6 months ago

I used panda70m_training_full.csv, It should be related to my cutting of csv file through pandas,No errors occur when I use the entire csv file

WANGCHENGQIAN0015 commented 6 months ago

我用的是panda70m_training_full.csv,应该和我通过pandas切割csv文件有关,使用整个csv文件没有出现错误

I take back this answer,When I downloaded the entire document, I found that I also had this problem

guangyliu commented 6 months ago

same problem here. any solution? @tsaishien-chen

tsaishien-chen commented 6 months ago

Hi @guangyliu Thanks for your interest. May I know which csv file are you using? and how did you get the csv file?

guangyliu commented 6 months ago

Hi @guangyliu Thanks for your interest. May I know which csv file are you using? and how did you get the csv file?

I tried both _panda70m_training2m.csv and _panda70m_trainingfull.csv , which are directly downloaded from the given google drive. And the video2dataset is from your repo. I just created a new conda enviroment with python=3.10 and pip install -e . under the video2dataset path.

tsaishien-chen commented 6 months ago

Hi @guangyliu, Could you try validation or testing.csv first and let me know whether it can work or not? Thanks!

guangyliu commented 6 months ago

Same problem. I think you can reproduce the probelm by creating a new conda env with python=3.10 and directly pip install -e . under the video2dataset path.

guangyliu commented 6 months ago

And may I know which verison of python and pyarrow are you using?

tsaishien-chen commented 6 months ago

@guangyliu: I am using Python 3.10.10 and pyarrow 14.0.1. May I also know yours?

guangyliu commented 6 months ago

@tsaishien-chen accelerate 0.28.0 aiohttp 3.9.3 aiosignal 1.3.1 antlr4-python3-runtime 4.9.3 appdirs 1.4.4 asttokens 2.4.1 async-timeout 4.0.3 beautifulsoup4 4.12.3 bitsandbytes 0.43.0 braceexpand 0.1.7 Brotli 1.1.0 certifi 2024.2.2 cffi 1.16.0 charset-normalizer 3.3.2 click 8.1.7 cmake 3.28.3 datasets 2.17.1 decorator 5.1.1 decord 0.6.0 dill 0.3.8 distlib 0.3.8 docker-pycreds 0.4.0 docopt 0.6.2 einops 0.7.0 evaluate 0.4.1 exceptiongroup 1.2.0 executing 2.0.1 ffmpeg-python 0.2.0 filelock 3.13.1 fire 0.4.0 frozenlist 1.4.1 fsspec 2023.10.0 future 1.0.0 gdown 5.1.0 gitdb 4.0.11 GitPython 3.1.42 huggingface-hub 0.21.4 idna 3.6 ipdb 0.13.13 ipython 8.22.2 jedi 0.19.1 Jinja2 3.1.3 langdetect 1.0.9 lit 18.1.1 MarkupSafe 2.1.5 matplotlib-inline 0.1.6 mpmath 1.3.0 multidict 6.0.5 multiprocess 0.70.16 mutagen 1.47.0 networkx 3.2.1 numpy 1.26.4 nvidia-cublas-cu11 11.10.3.66 nvidia-cuda-cupti-cu11 11.7.101 nvidia-cuda-nvrtc-cu11 11.7.99 nvidia-cuda-runtime-cu11 11.7.99 nvidia-cudnn-cu11 8.5.0.96 nvidia-cufft-cu11 10.9.0.58 nvidia-curand-cu11 10.2.10.91 nvidia-cusolver-cu11 11.4.0.1 nvidia-cusparse-cu11 11.7.4.91 nvidia-nccl-cu11 2.14.3 nvidia-nvtx-cu11 11.7.91 omegaconf 2.3.0 opencv-python 4.9.0.80 packaging 24.0 pandas 2.2.1 parso 0.8.3 pexpect 4.9.0 pillow 10.2.0 pip 23.3.1 platformdirs 4.2.0 prompt-toolkit 3.0.43 protobuf 4.25.3 psutil 5.9.8 ptyprocess 0.7.0 pure-eval 0.2.2 pyarrow 14.0.1 pyarrow-hotfix 0.6 pycparser 2.21 pycryptodomex 3.20.0 Pygments 2.17.2 PySocks 1.7.1 python-dateutil 2.9.0.post0 pytz 2024.1 PyYAML 6.0.1 regex 2023.12.25 requests 2.31.0 responses 0.18.0 safetensors 0.4.2 scenedetect 0.6 scipy 1.12.0 sentry-sdk 1.42.0 setproctitle 1.3.3 setuptools 68.2.2 six 1.16.0 smmap 5.0.1 soundfile 0.12.1 soupsieve 2.5 stack-data 0.6.3 sympy 1.12 termcolor 2.4.0 timeout-decorator 0.5.0 tokenizers 0.13.3 tomli 2.0.1 torch 2.0.0 torchaudio 2.0.0 torchdata 0.6.0 tqdm 4.66.2 traitlets 5.14.2 transformers 4.30.0 triton 2.0.0 typing_extensions 4.10.0 tzdata 2024.1 urllib3 2.2.1 video2dataset 1.2.0 /lustre/scratch/Panda-70M/dataset_dataloading/video2dataset virtualenv 20.25.0 wandb 0.16.4 wcwidth 0.2.13 webdataset 0.2.86 websockets 12.0 webvtt-py 0.4.6 wheel 0.41.2 xxhash 3.4.1 yarl 1.9.4 yt-dlp 2024.3.10

tsaishien-chen commented 6 months ago

@guangyliu: Hmm, it seems like the package version is not the problem. Could you please also paste your error message? I would like to know in which step you got that error. Thanks!

guangyliu commented 6 months ago

Traceback (most recent call last): File "/lustre/scratch/Panda-70M/dataset_dataloading/video2dataset/video2dataset/workers/download_worker.py", line 102, in call self.download_shard(row) File "/lustre/scratch/Panda-70M/dataset_dataloading/video2dataset/video2dataset/workers/download_worker.py", line 291, in download_shard sample_writer.close() File "/lustre/scratch/Panda-70M/dataset_dataloading/video2dataset/video2dataset/data_writer.py", line 321, in close self.buffered_parquet_writer.close() File "/lustre/scratch/Panda-70M/dataset_dataloading/video2dataset/video2dataset/data_writer.py", line 50, in close self.flush() File "/lustre/scratch/Panda-70M/dataset_dataloading/video2dataset/video2dataset/data_writer.py", line 45, in flush df = pa.Table.from_pydict(self.buffer, self.schema) File "pyarrow/table.pxi", line 1812, in pyarrow.lib._Tabular.from_pydict File "pyarrow/table.pxi", line 5292, in pyarrow.lib._from_pydict File "pyarrow/array.pxi", line 374, in pyarrow.lib.asarray File "pyarrow/array.pxi", line 344, in pyarrow.lib.array File "pyarrow/array.pxi", line 42, in pyarrow.lib._sequence_to_array File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status pyarrow.lib.ArrowTypeError: Expected bytes, got a 'list' object

The line number may be slightly different since I add some print command in the file. From my observation, the mp4 file can be downloaded successfully and the probelm appeared when the file number achieves _number_sample_pershard and save the video from buffer to disk. So if you want to reproduce the problem, you may change _number_sample_pershard to be a small number like 2 or 5.

tsaishien-chen commented 6 months ago

Hi @guangyliu, As you suggested, I tried to reduce number_sample_per_shard and it can work. I also tried to run on another machine with Python 3.10.10 and pyarrow 15.0.0 and it can work.

guangyliu commented 6 months ago

thanks, I have figured it out by reinstalling ffmpeg in conda.

tsaishien-chen commented 6 months ago

@guangyliu Nice! May I know which version ffmpeg are you using now? As this seems like a common issue, I would like document this and your solution into readme. Thanks!

guangyliu commented 6 months ago

After investigation, I found there was always an exception, and the _errormessage is

"FileNotFoundError: [Errno 2] No such file or directory: 'ffmpeg'

Then I found this issue and this solution works for me. Now the ffmpeg version is:

conda list|grep ffmpeg ffmpeg 4.2.2 h20bf706_0
ffmpeg-python 0.2.0 pypi_0 pypi

MaydaY-Tsinghua commented 6 months ago

I am having the same problem. my ffmpeg info is ffmpeg 4.2.2 h20bf706_0 ffmpeg-python 0.2.0 pypi_0 pypi

MaydaY-Tsinghua commented 6 months ago

I have tried with the test set and the downloading raised the error half way through the csv file

guangyliu commented 6 months ago

I have tried with the test set and the downloading raised the error half way through the csv file

This problem appears because the meta['clips'] is set as an empty list when there is an exception like here. For debugging, I suggest that you set _number_sample_pershard to be a small number like 2. And invesigate the _errormessage in the exception you meet.

tsaishien-chen commented 6 months ago

Hi @MaydaY-Tsinghua, Have you fixed the problem by reinstalling ffmpeg as suggested here: https://github.com/snap-research/Panda-70M/issues/8#issuecomment-1998075778?

MaydaY-Tsinghua commented 6 months ago

Hi @MaydaY-Tsinghua, Have you fixed the problem by reinstalling ffmpeg as suggested here: #8 (comment)?

I am trying to debug using the idea of https://github.com/snap-research/Panda-70M/issues/8#issuecomment-1998988591 but I ran into the issue Err403 this time before I can do anything else. I don't have this issue Err403 just a few days ago.

MaydaY-Tsinghua commented 6 months ago

Hi @MaydaY-Tsinghua, Have you fixed the problem by reinstalling ffmpeg as suggested here: #8 (comment)?

I am trying to debug using the idea of #8 (comment) but I ran into the issue Err403 this time before I can do anything else. I don't have this issue Err403 just a few days ago.

It seems like a problem with my own IP as I can not download any video from youtube

tsaishien-chen commented 6 months ago

Hi @MaydaY-Tsinghua, error 403 represents an IP problem. Please try to download the dataset by a proxy.

18756164789 commented 4 months ago

I also encountered the same problem, have you solved it?

18756164789 commented 4 months ago

@guangyliu May I ask how you solved this problem

tsaishien-chen commented 4 months ago

Hi @18756164789, Thanks for your interest on Panda-70M dataset! This is a known issue. Have you tried to update ffmpeg package by pip or conda like this?

xiuxiu commented 4 months ago

Have the same issue. Ubuntu: 22.04 ffmpeg: 4.4.2 ffmpeg-python: 0.2.0

tried each solution in the tread but had no luck.

guangyliu commented 4 months ago

@guangyliu May I ask how you solved this problem

As I said here, this problem is led by some exceptions. I think the specific exception is case-by-case, and it may be different. But overall, you should find the specific exception and then to solve the problem.

tsaishien-chen commented 4 months ago

Hi @18756164789 and @xiuxiu, Sorry for your inconvenience. Have you solved this issue?

linzhiqiu commented 3 months ago

I am still getting this error