pytorch / text

Models, data loaders and abstractions for language processing, powered by PyTorch
https://pytorch.org/text
BSD 3-Clause "New" or "Revised" License
3.49k stars 813 forks source link

Passing a list of integers instead of a list of strings to torchtext.voca.lookup_indices CRASHES the kernel #2184

Open Arunprakash-A opened 1 year ago

Arunprakash-A commented 1 year ago

🐛 Describe the bug

I was using torchtext to construct a vocabulary (v) and tokenize the input text. I just wanted to check the indices of the tokens and therefore used the method v.vocab.lookup_indices(). Accidentally, I passed the list of integers as an argument instead of the list of string as expected by the method. Then it crashed the Jupyter kernel (I am using colab notebook).

# Assume all necessary imports
v = build_vocab_from_iterator(yield_tokens(sentences),min_freq=1,specials=['<pad>','<unk>','<mask>'])
v.set_default_index(v['<unk>'])  
# the Jupyter kernel crashed after executing the statement below
v.vocab.lookup_indices([0,1,2])

I looked at the source code at https://pytorch.org/text/stable/_modules/torchtext/vocab/vocab.html#Vocab.lookup_indices and found that the type checking (of elements in the list) is missing in the function definition.

def lookup_indices(self, tokens: List[str]) -> List[int]:
        r"""
        Args:
            tokens: the tokens used to lookup their corresponding `indices`.

        Returns:
            The 'indices` associated with `tokens`.
        """
        return self.vocab.lookup_indices(tokens)

Versions

Collecting environment information... PyTorch version: 2.0.0+cu117 Is debug build: False CUDA used to build PyTorch: 11.7 ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 Clang version: 10.0.0-4ubuntu1 CMake version: version 3.25.2 Libc version: glibc-2.31

Python version: 3.10.12 (main, Jun 7 2023, 12:45:35) [GCC 9.4.0] (64-bit runtime) Python platform: Linux-5.15.107+-x86_64-with-glibc2.31 Is CUDA available: False CUDA runtime version: 11.8.89 CUDA_MODULE_LOADING set to: N/A GPU models and configuration: Could not collect Nvidia driver version: Could not collect cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.0 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 46 bits physical, 48 bits virtual CPU(s): 2 On-line CPU(s) list: 0,1 Thread(s) per core: 2 Core(s) per socket: 1 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 79 Model name: Intel(R) Xeon(R) CPU @ 2.20GHz Stepping: 0 CPU MHz: 2199.998 BogoMIPS: 4399.99 Hypervisor vendor: KVM Virtualization type: full L1d cache: 32 KiB L1i cache: 32 KiB L2 cache: 256 KiB L3 cache: 55 MiB NUMA node0 CPU(s): 0,1 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Mitigation; PTE Inversion Vulnerability Mds: Vulnerable; SMT Host state unknown Vulnerability Meltdown: Vulnerable Vulnerability Mmio stale data: Vulnerable Vulnerability Retbleed: Vulnerable Vulnerability Spec store bypass: Vulnerable Vulnerability Spectre v1: Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers Vulnerability Spectre v2: Vulnerable, IBPB: disabled, STIBP: disabled, PBRSB-eIBRS: Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Vulnerable Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat md_clear arch_capabilities

Versions of relevant libraries: [pip3] numpy==1.22.4 [pip3] torch==2.0.0 [pip3] torchaudio==2.0.2+cu118 [pip3] torchdata==0.6.0 [pip3] torchsummary==1.5.1 [pip3] torchtext==0.15.2 [pip3] torchvision==0.15.2+cu118 [pip3] triton==2.0.0 [conda] Could not collect

bdhirsh commented 1 year ago

Moving this issue to torch/text, since it looks specific to some torchtext APIs.