pytorch / text

Models, data loaders and abstractions for language processing, powered by PyTorch
https://pytorch.org/text
BSD 3-Clause "New" or "Revised" License
3.5k stars 812 forks source link

t5_demo can't retrieve CNNDM from drive.google; how to use local copy? #2264

Open rbelew opened 4 months ago

rbelew commented 4 months ago

šŸ› Bug

Describe the bug A clear and concise description of what the bug is.

Following the t5_demo, but when it tries to access the CNN data at https://drive.google.com/uc?export=download&id=0BwmD_VLjROrfTHk4NFg2SndKcjQ

To Reproduce Steps to reproduce the behavior:

  1. Get notebook at t5_demo,

  2. Try to run it. It gets as far as batch = next(iter(cnndm_dataloader)) (https://pytorch.org/text/stable/tutorials/t5_demo.html#generate-summaries) where cnndm_datapipe = CNNDM(split="test") (https://pytorch.org/text/stable/tutorials/t5_demo.html#datasets)

  3. Get error like:

RuntimeError: Google drive link

https://drive.google.com/uc?export=download&id=0BwmD_VLjROrfTHk4NFg2SndKcjQ&confirm=t internal error: headers don't contain content-disposition. This is usually caused by using a sharing/viewing link instead of a download link. Click 'Download' on the Google Drive page, which should redirect you to a download page, and use the link of that page.

This exception is thrown by iter of GDriveReaderDataPipe(skip_on_error=False, source_datapipe=OnDiskCacheHolderIterDataPipe, timeout=None)

Expected behavior

Looking at others with similar error messages makes it seem like there is some timeout issue retrieving from drive.google? So I went and got the cnn_stories.tgz and dailymail_stories.tgz and unpacked them:

. ā”œā”€ā”€ CNNDM ā”‚Ā Ā  ā”œā”€ā”€ cnn ā”‚Ā Ā  ā”‚Ā Ā  ā””ā”€ā”€ stories ā”‚Ā Ā  ā””ā”€ā”€ dailymail ā”‚Ā Ā  ā””ā”€ā”€ stories

How can I modify the calls retrieve from my local cache?

Environment

% python collect_env.py Collecting environment information... PyTorch version: 2.1.0.post100 Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A

OS: macOS 14.4.1 (arm64) GCC version: Could not collect Clang version: 15.0.0 (clang-1500.1.0.2.5) CMake version: Could not collect Libc version: N/A

Python version: 3.11.7 | packaged by conda-forge | (main, Dec 23 2023, 14:38:07) [Clang 16.0.6 ] (64-bit runtime) Python platform: macOS-14.4.1-arm64-arm-64bit Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Apple M1 Pro

Versions of relevant libraries: [pip3] mypy-extensions==1.0.0 [pip3] numpy==1.26.3 [pip3] torch==2.1.0.post100 [pip3] torchaudio==2.1.2 [pip3] torchdata==0.7.1 [pip3] torchtext==0.16.1 [pip3] torchvision==0.16.2 [conda] captum 0.7.0 0 pytorch [conda] numpy 1.26.2 pypi_0 pypi [conda] numpy-base 1.26.3 py311hfbfe69c_0
[conda] pytorch 2.1.0 gpu_mps_py311hf322ab5_100
[conda] torch 2.1.2 pypi_0 pypi [conda] torchaudio 2.1.2 pypi_0 pypi [conda] torchdata 0.7.1 pypi_0 pypi [conda] torchtext 0.16.1 pypi_0 pypi [conda] torchvision 0.16.2 pypi_0 pypi

Additional context Add any other context about the problem here.