IMDB with default root only loads half the data

JannisZeller commented 1 year ago

🐛 Bug

Description

If I try to load the IMDb data via torchtext.datasets.IMDB() with no root argument the docs say the data should be loaded to os.path.expanduser('~/.torchtext/cache'), which is 'C:\\Users\\USERNAME/.torchtext/cache'. Yet I cannot find the data there and (what is actually the bigger problem) only half the data is pulled (the 1-labelled half). If I explicitly set root = C:/Users/USERNAME/.torchtext/cache' everything works fine. Might be caused by some path-problems but the torchtext.datasets.IMDB source is a little cryptic for me to read...

To Reproduce Run:

from torchtext.datasets import IMDB
import numpy as np

trn_rawpipe = IMDB(split="train")
targets = []
for x in trn_rawpipe:
    targets.append(x[0])

print(len(targets))
>> 12500

print(np.unique(targets, return_counts=True))
>> (array([1]), array([12500], dtype=int64))

trn_rawpipe = IMDB("C:/Users/USERNAME/.torchtext/cache", split="train")
targets = []
for x in trn_rawpipe:
    targets.append(x[0])

print(len(targets))
>> 25000

print(np.unique(targets, return_counts=True))
>> (array([1, 2]), array([12500, 12500], dtype=int64))

Expected Behaviour

It should not make any difference.

Environment

PyTorch Version: 1.13.1
OS (e.g., Linux): Windows 11 (10.0.22621 Build 22621)
How you installed PyTorch: conda
Python version: 3.10.8
CUDA/cuDNN version: 11.7
GPU models and configuration: RTX 2080 Super

Any other relevant information: I installed all packages following the instructions suitable for my setup. The other torch-related packages are the following:

# Name                    Version                   Build    Channel
pytorch                   1.13.1          py3.10_cuda11.7_cudnn8_0    pytorch
pytorch-cuda              11.7                 h67b0de4_1    pytorch
pytorch-mutex             1.0                        cuda    pytorch
torch-scatter             2.1.0+pt113cu117          pypi_0    pypi
torch-sparse              0.6.16+pt113cu117          pypi_0    pypi
torchaudio                0.13.1                   pypi_0    pypi
torchdata                 0.5.1              pyh2db4395_0    conda-forge
torchsummary              1.5.1                    pypi_0    pypi
torchtext                 0.14.1                    py310    pytorch
torchvision               0.14.1                   pypi_0    pypi

Running the loader with the default root in this notebook from the tutorial in colab works out despite the packages being mainly the same. I appreciate any help, keep up the good work!

Edit: Typo and unification of both approaches.

Nayef211 commented 1 year ago

Hey @JannisZeller, thanks for submitting this issue! I tried reproing this on Google Colab but it seems like the behavior is correct regardless of whether I provide the root directory. I unfortunately don't have access to a Windows environment to figure out if this is a Windows specific issue.

I'm linking a gist of my notebook below: https://gist.github.com/Nayef211/22f1d9b70db1814e4cc7a1d4be875acd

JannisZeller commented 1 year ago

Hello @Nayef211, thank you for taking the time to reply. As I mentioned in the post (all the way down to the end) I already noticed, that it worked out in Google Colab. This is also why I suspect it is Windows-specific and connected to filepaths...

mthrok commented 1 year ago

@JannisZeller Can I ask if you are using bash.exe or cmd.exe to launch Python?

JannisZeller commented 1 year ago

@mthrok: Thank you for the advise. The result is the same when running it from my shells (Powershell 7.3.2, git --version 2.34.1.windows.1 bash, & cmd) and inside of jupyter notebooks.

yefanTao commented 1 year ago

I am having the same issue in ubuntu.

from torchtext.datasets import IMDB   
train_iter, test_iter = IMDB(split=('train','test'))  
print(len([label for label, data in train_iter]))

the train dataset has only entries with positive labels.

mertunsall commented 1 year ago

Same issue in Windows 10!

mthrok commented 1 year ago

I tried this on Windows from the main branch and it seems to be working fine.

(base) C:\Users\moto\Development\text>python repro\2041.py
25000
(array([1, 2]), array([12500, 12500], dtype=int64))
25000
(array([1, 2]), array([12500, 12500], dtype=int64))

334883753_877288553384998_6696217268406588084_n

Collecting environment information...
PyTorch version: 2.0.0.dev20230117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 10 Enterprise
GCC version: Could not collect
Clang version: Could not collect
CMake version: version 3.22.4
Libc version: N/A

Python version: 3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.19045-SP0
Is CUDA available: True
CUDA runtime version: 11.7.99
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3080
Nvidia driver version: 517.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.22.3
[pip3] torch==2.0.0.dev20230117
[pip3] torchaudio==2.0.0a0+6a39b3e
[pip3] torchdata==0.5.1
[pip3] torchtext==0.15.0a0+38399ea
[conda] blas                      1.0                         mkl
[conda] cudatoolkit               11.3.1               h59b6b97_2
[conda] mkl                       2021.4.0           haa95532_640
[conda] mkl-service               2.4.0            py39h2bbff1b_0
[conda] mkl_fft                   1.3.1            py39h277e83a_0
[conda] mkl_random                1.2.2            py39hf11a4ad_0
[conda] numpy                     1.22.3           py39h7a0a035_0
[conda] numpy-base                1.22.3           py39hca35cd5_0
[conda] pytorch                   2.0.0.dev20230117 py3.9_cuda11.7_cudnn8_0    pytorch-nightly
[conda] pytorch-cuda              11.7                 h67b0de4_2    pytorch-nightly
[conda] pytorch-mutex             1.0                        cuda    pytorch-nightly
[conda] torchaudio                2.0.0a0+6a39b3e           dev_0    <develop>
[conda] torchdata                 0.5.1                    pypi_0    pypi
[conda] torchtext                 0.15.0a0+38399ea           dev_0    <develop>

SeanNobel commented 1 year ago

Same. I was working with torchtext==0.15.1 but if I downgrade it to 0.14.0 it worked fine.

Additionally in 0.15.1 labels are just 1 and 2, not 'neg' and 'pos'.

In torchtext==0.14.0:

import torchtext
from collections import Counter

def check_labels(_iter) -> None:
    labels = [batch[0] for batch in _iter]

    counter = Counter(labels)

    print(counter.items())

train_iter, test_iter = torchtext.datasets.IMDB(split=('train', 'test'))
check_labels(train_iter)
check_labels(test_iter)

dict_items([('neg', 12500), ('pos', 12500)])
dict_items([('neg', 12500), ('pos', 12500)])

In torchtext==0.15.1:

import torch, torchtext

def check_labels(_iter) -> None:
    labels = torch.tensor([batch[0] for batch in _iter])

    unique_labels, counts = labels.unique(return_counts=True)

    print(unique_labels.tolist(), counts.tolist())

train_iter, test_iter = torchtext.datasets.IMDB(split=('train', 'test'))
check_labels(train_iter)
check_labels(test_iter)

[1] [12500]
[1, 2] [12500, 12500]

Nayef211 commented 1 year ago

Additionally in 0.15.1 labels are just 1 and 2, not 'neg' and 'pos'.

Hi @SeanNobel, this behavior was actually purposefully changed in https://github.com/pytorch/text/pull/1914 to ensure that all datasets had consistent integer labels. Let me take another look to see if I can repro the behavior. @SeanNobel can your confirm what OS you see this issue on? Previously, I wasn't able to repro the issue on a Linux or a Mac device.

SeanNobel commented 1 year ago

@Nayef211 Thanks, I'm creating a lecture material and working on VSCode Jupyter. I ran the same code above on Colab and it turned out that the dataset comes correctly with torchtext==0.15.1 on Colab (after pip installing portalocker). Although the notebook will be run on Colab in the lecture, there still seems to be the issue in my environment:

PyTorch version: 2.0.0+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.5 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.26.1
Libc version: glibc-2.31

Python version: 3.9.12 (main, Jun  1 2022, 11:38:51)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.4.0-139-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA RTX A6000
GPU 1: NVIDIA RTX A6000
GPU 2: NVIDIA RTX A6000
GPU 3: NVIDIA RTX A6000
GPU 4: NVIDIA RTX A6000

Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.22.4
[pip3] pytorch-lightning==1.9.0
[pip3] torch==2.0.0
[pip3] torchaudio==2.0.1
[pip3] torchdata==0.6.0
[pip3] torchmetrics==0.9.2
[pip3] torchtext==0.15.1
[pip3] torchvision==0.15.1
[conda] numpy                     1.22.4                   pypi_0    pypi
[conda] pytorch-lightning         1.9.0                    pypi_0    pypi
[conda] torch                     2.0.0                    pypi_0    pypi
[conda] torchaudio                2.0.1                    pypi_0    pypi
[conda] torchdata                 0.6.0                    pypi_0    pypi
[conda] torchmetrics              0.9.2                    pypi_0    pypi
[conda] torchtext                 0.15.1                   pypi_0    pypi
[conda] torchvision               0.15.1                   pypi_0    pypi

leitelyaya commented 1 year ago

Same in Ubuntu. But I've sloved it. Just remove the cache directory ~/.cache/torch/datasets/IMDB/aclImdb_v1

lucasbasquerotto commented 1 year ago

I experienced this issue in Windows 11 (torch 2.0.1, torchtext 0.15.2), but in my case I had root defined as ../data/IMDB.

Based on the comment from @leitelyaya, I removed the directory /IMDB/aclImdb_v1 (but kept aclImdb_v1.tar.gz) and run again, and it worked.

It's weird, because the same compacted file was used, so it seems that the issue happened when extracting the contents the first time (skipping the pos directory).

mvsjober commented 10 months ago

I'm seeing the same issue in torchtext 0.16.1, and removing the aclImdb_v1 directory doesn't seem to help for me. In my case it's running on a Linux machine, not Windows.

mvsjober commented 10 months ago

Here is one observation that might help anyone trying to debug this. It seems like I break something with the IMDB dataset if I sample some items from the iterator before looping through it fully. This code always produces the problematic situation:

from torchtext import datasets
ds = datasets.IMDB('./data', split='train')

# print one sample from the dataset
for label, text in ds:
    print(text, label)
    break

# check label counts
counts={}
for label, text in ds:
    counts[label] = counts.get(label, 0) + 1
for key, value in counts.items():
    print(f"label: {key}, count: {value}")

This prints (I cut out some of the review text):

I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it... 1

/usr/local/lib/python3.10/dist-packages/torch/utils/data/datapipes/iter/combining.py:297: UserWarning: Some child DataPipes are not exhausted when __iter__ is called. We are resetting the buffer and each child DataPipe will read from the start again.
  warnings.warn("Some child DataPipes are not exhausted when __iter__ is called. We are resetting "

label: 1, count: 12500

After this the file train.torchdata_list in the directory data/datasets/IMDB is only 67 bytes and in any later attempts to load the dataset it will always be broken, i.e., have only half the data (with label 1).

Instead if I remove the data/datasets/IMDB directory and run the code without the "print one sample from the dataset" part it works fine, and the resulting train.torchdata_list file has a size of 134 bytes, and the code outputs the expected:

label: 1, count: 12500
label: 2, count: 12500

pytorch / text