Open JannisZeller opened 1 year ago
Hey @JannisZeller, thanks for submitting this issue! I tried reproing this on Google Colab but it seems like the behavior is correct regardless of whether I provide the root
directory. I unfortunately don't have access to a Windows environment to figure out if this is a Windows specific issue.
I'm linking a gist of my notebook below: https://gist.github.com/Nayef211/22f1d9b70db1814e4cc7a1d4be875acd
Hello @Nayef211, thank you for taking the time to reply. As I mentioned in the post (all the way down to the end) I already noticed, that it worked out in Google Colab. This is also why I suspect it is Windows-specific and connected to filepaths...
@JannisZeller Can I ask if you are using bash.exe or cmd.exe to launch Python?
@mthrok: Thank you for the advise. The result is the same when running it from my shells (Powershell 7.3.2, git --version 2.34.1.windows.1 bash, & cmd) and inside of jupyter notebooks.
I am having the same issue in ubuntu.
from torchtext.datasets import IMDB
train_iter, test_iter = IMDB(split=('train','test'))
print(len([label for label, data in train_iter]))
the train dataset has only entries with positive labels.
Same issue in Windows 10!
I tried this on Windows from the main branch and it seems to be working fine.
(base) C:\Users\moto\Development\text>python repro\2041.py
25000
(array([1, 2]), array([12500, 12500], dtype=int64))
25000
(array([1, 2]), array([12500, 12500], dtype=int64))
Collecting environment information...
PyTorch version: 2.0.0.dev20230117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A
OS: Microsoft Windows 10 Enterprise
GCC version: Could not collect
Clang version: Could not collect
CMake version: version 3.22.4
Libc version: N/A
Python version: 3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.19045-SP0
Is CUDA available: True
CUDA runtime version: 11.7.99
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3080
Nvidia driver version: 517.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] numpy==1.22.3
[pip3] torch==2.0.0.dev20230117
[pip3] torchaudio==2.0.0a0+6a39b3e
[pip3] torchdata==0.5.1
[pip3] torchtext==0.15.0a0+38399ea
[conda] blas 1.0 mkl
[conda] cudatoolkit 11.3.1 h59b6b97_2
[conda] mkl 2021.4.0 haa95532_640
[conda] mkl-service 2.4.0 py39h2bbff1b_0
[conda] mkl_fft 1.3.1 py39h277e83a_0
[conda] mkl_random 1.2.2 py39hf11a4ad_0
[conda] numpy 1.22.3 py39h7a0a035_0
[conda] numpy-base 1.22.3 py39hca35cd5_0
[conda] pytorch 2.0.0.dev20230117 py3.9_cuda11.7_cudnn8_0 pytorch-nightly
[conda] pytorch-cuda 11.7 h67b0de4_2 pytorch-nightly
[conda] pytorch-mutex 1.0 cuda pytorch-nightly
[conda] torchaudio 2.0.0a0+6a39b3e dev_0 <develop>
[conda] torchdata 0.5.1 pypi_0 pypi
[conda] torchtext 0.15.0a0+38399ea dev_0 <develop>
Same. I was working with torchtext==0.15.1
but if I downgrade it to 0.14.0
it worked fine.
Additionally in 0.15.1
labels are just 1 and 2, not 'neg' and 'pos'.
In torchtext==0.14.0
:
import torchtext
from collections import Counter
def check_labels(_iter) -> None:
labels = [batch[0] for batch in _iter]
counter = Counter(labels)
print(counter.items())
train_iter, test_iter = torchtext.datasets.IMDB(split=('train', 'test'))
check_labels(train_iter)
check_labels(test_iter)
dict_items([('neg', 12500), ('pos', 12500)])
dict_items([('neg', 12500), ('pos', 12500)])
In torchtext==0.15.1
:
import torch, torchtext
def check_labels(_iter) -> None:
labels = torch.tensor([batch[0] for batch in _iter])
unique_labels, counts = labels.unique(return_counts=True)
print(unique_labels.tolist(), counts.tolist())
train_iter, test_iter = torchtext.datasets.IMDB(split=('train', 'test'))
check_labels(train_iter)
check_labels(test_iter)
[1] [12500]
[1, 2] [12500, 12500]
Additionally in
0.15.1
labels are just 1 and 2, not 'neg' and 'pos'.
Hi @SeanNobel, this behavior was actually purposefully changed in https://github.com/pytorch/text/pull/1914 to ensure that all datasets had consistent integer labels. Let me take another look to see if I can repro the behavior. @SeanNobel can your confirm what OS you see this issue on? Previously, I wasn't able to repro the issue on a Linux or a Mac device.
@Nayef211 Thanks, I'm creating a lecture material and working on VSCode Jupyter. I ran the same code above on Colab and it turned out that the dataset comes correctly with torchtext==0.15.1
on Colab (after pip installing portalocker).
Although the notebook will be run on Colab in the lecture, there still seems to be the issue in my environment:
PyTorch version: 2.0.0+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.5 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.26.1
Libc version: glibc-2.31
Python version: 3.9.12 (main, Jun 1 2022, 11:38:51) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.4.0-139-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA RTX A6000
GPU 1: NVIDIA RTX A6000
GPU 2: NVIDIA RTX A6000
GPU 3: NVIDIA RTX A6000
GPU 4: NVIDIA RTX A6000
Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.22.4
[pip3] pytorch-lightning==1.9.0
[pip3] torch==2.0.0
[pip3] torchaudio==2.0.1
[pip3] torchdata==0.6.0
[pip3] torchmetrics==0.9.2
[pip3] torchtext==0.15.1
[pip3] torchvision==0.15.1
[conda] numpy 1.22.4 pypi_0 pypi
[conda] pytorch-lightning 1.9.0 pypi_0 pypi
[conda] torch 2.0.0 pypi_0 pypi
[conda] torchaudio 2.0.1 pypi_0 pypi
[conda] torchdata 0.6.0 pypi_0 pypi
[conda] torchmetrics 0.9.2 pypi_0 pypi
[conda] torchtext 0.15.1 pypi_0 pypi
[conda] torchvision 0.15.1 pypi_0 pypi
Same in Ubuntu. But I've sloved it. Just remove the cache directory ~/.cache/torch/datasets/IMDB/aclImdb_v1
I experienced this issue in Windows 11 (torch 2.0.1
, torchtext 0.15.2
), but in my case I had root defined as ../data/IMDB
.
Based on the comment from @leitelyaya, I removed the directory /IMDB/aclImdb_v1
(but kept aclImdb_v1.tar.gz
) and run again, and it worked.
It's weird, because the same compacted file was used, so it seems that the issue happened when extracting the contents the first time (skipping the pos
directory).
I'm seeing the same issue in torchtext 0.16.1, and removing the aclImdb_v1
directory doesn't seem to help for me. In my case it's running on a Linux machine, not Windows.
Here is one observation that might help anyone trying to debug this. It seems like I break something with the IMDB dataset if I sample some items from the iterator before looping through it fully. This code always produces the problematic situation:
from torchtext import datasets
ds = datasets.IMDB('./data', split='train')
# print one sample from the dataset
for label, text in ds:
print(text, label)
break
# check label counts
counts={}
for label, text in ds:
counts[label] = counts.get(label, 0) + 1
for key, value in counts.items():
print(f"label: {key}, count: {value}")
This prints (I cut out some of the review text):
I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it... 1
/usr/local/lib/python3.10/dist-packages/torch/utils/data/datapipes/iter/combining.py:297: UserWarning: Some child DataPipes are not exhausted when __iter__ is called. We are resetting the buffer and each child DataPipe will read from the start again.
warnings.warn("Some child DataPipes are not exhausted when __iter__ is called. We are resetting "
label: 1, count: 12500
After this the file train.torchdata_list
in the directory data/datasets/IMDB
is only 67 bytes and in any later attempts to load the dataset it will always be broken, i.e., have only half the data (with label 1).
Instead if I remove the data/datasets/IMDB
directory and run the code without the "print one sample from the dataset" part it works fine, and the resulting train.torchdata_list
file has a size of 134 bytes, and the code outputs the expected:
label: 1, count: 12500
label: 2, count: 12500
🐛 Bug
Description
If I try to load the IMDb data via
torchtext.datasets.IMDB()
with noroot
argument the docs say the data should be loaded toos.path.expanduser('~/.torchtext/cache')
, which is'C:\\Users\\USERNAME/.torchtext/cache'
. Yet I cannot find the data there and (what is actually the bigger problem) only half the data is pulled (the 1-labelled half). If I explicitly setroot = C:/Users/USERNAME/.torchtext/cache'
everything works fine. Might be caused by some path-problems but thetorchtext.datasets.IMDB
source is a little cryptic for me to read...To Reproduce Run:
Expected Behaviour
It should not make any difference.
Environment
Running the loader with the default root in this notebook from the tutorial in colab works out despite the packages being mainly the same. I appreciate any help, keep up the good work!
Edit: Typo and unification of both approaches.