libffcv / ffcv

FFCV: Fast Forward Computer Vision (and other ML workloads!)
https://ffcv.io
Apache License 2.0
2.8k stars 180 forks source link

Error in */ffcv/pipeline/graph.py: "AttributeError: 'str' object has no attribute 'type'" #380

Open paullintilhac opened 3 days ago

paullintilhac commented 3 days ago

I am trying to run code from the following repo which uses ffcv: https://github.com/MadryLab/datamodels. I have tried to run the example in this repo both on my university's own slurm cluster and on google colab, and I keep ending up with the same error:

AttributeError: 'str' object has no attribute 'type'

I was able to edit one of the ffcv package source files directly in order to get rid of this error, but then it predictably gave me another error RuntimeError: No HIP GPUs are available Specifically, when I make the edit to the file referenced below (/usr/local/lib/python3.10/dist-packages/ffcv/pipeline/graph.py) by changing line 333 from if next_state.device.type != 'cuda' to if next_state.device != 'cuda:0', I can get it to run on google colab (though I get a different error, RuntimeError: No HIP GPUs are available on my lab's slurm cluster).

Is anyone else experiencing this error with ffcv when trying to run training code?

Steps to reproduce (the file system shown here is what I used for colab, but you can replace the paths with whatever download directory you use on whatever system you have):

install dependencies

git clone https://github.com/MadryLab/datamodels.git cd datamodels pip install fastargs pip install terminaltables wget http://ftp.gnu.org/gnu/parallel/parallel-latest.tar.bz2 tar xjf parallel-latest.tar.bz2 cd /content/datamodels/parallel-20240622 ./configure && make make install parallel --version cd /content/datamodels apt update && apt install -y --no-install-recommends libopencv-dev libturbojpeg-dev cp -f /usr/lib/x86_64-linux-gnu/pkgconfig/opencv.pc /usr/lib/x86_64-linux-gnu/pkgconfig/opencv4.pc pip install mosaicml ffcv numba opencv-python import torch pip install cupy-cuda12x from typing import List

download dataset

import torch as ch import torchvision

from ffcv.fields import IntField, RGBImageField from ffcv.fields.decoders import IntDecoder, SimpleRGBImageDecoder from ffcv.loader import Loader, OrderOption from ffcv.pipeline.operation import Operation from ffcv.transforms import RandomHorizontalFlip, Cutout, \ RandomTranslate, Convert, ToDevice, ToTensor, ToTorchImage from ffcv.transforms.common import Squeeze from ffcv.writer import DatasetWriter datasets = { 'train': torchvision.datasets.CIFAR10('/content', train=True, download=True), 'test': torchvision.datasets.CIFAR10('/content', train=False, download=True) }

for (name, ds) in datasets.items(): writer = DatasetWriter(f'/content/cifar_{name}.beton', { 'image': RGBImageField(), 'label': IntField() }) writer.from_indexed_dataset(ds) bash examples/cifar10/example.sh

I have tried many different conda environments, including with python 3.8 (as the repo suggests), and 3.9, cuda 12.1 and 12.2, and rocm 6 and 5.4. All of them give me one of the two above errors.

Any idea how I can get around this? Full stack trace:

�(0x�(B Parameter �(0x�(B Value �(0x�(B �(0tqqqqqqqqqqqqqqqqqqqqqqqqqqnqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqu�(B �(0x�(B worker.index �(0x�(B 1 �(0x�(B �(0x�(B worker.main_import �(0x�(B examples.cifar10.train_cifar �(0x�(B �(0x�(B worker.logdir �(0x�(B /tmp/10921 �(0x�(B �(0x�(B worker.do_if_complete �(0x�(B False �(0x�(B �(0x�(B worker.job_timeout �(0x�(B 99999999 �(0x�(B �(0x�(B training.lr �(0x�(B 0.5 �(0x�(B �(0x�(B training.epochs �(0x�(B 24 �(0x�(B �(0x�(B training.lr_peak_epoch �(0x�(B 5 �(0x�(B �(0x�(B training.batch_size �(0x�(B 512 �(0x�(B �(0x�(B training.momentum �(0x�(B 0.9 �(0x�(B �(0x�(B training.weight_decay �(0x�(B 0.0005 �(0x�(B �(0x�(B training.label_smoothing �(0x�(B 0.1 �(0x�(B �(0x�(B training.num_workers �(0x�(B 1 �(0x�(B �(0x�(B training.lr_tta �(0x�(B True �(0x�(B �(0x�(B data.train_dataset �(0x�(B /content/cifar_train.beton �(0x�(B �(0x�(B data.val_dataset �(0x�(B /content/cifar-ffcv/cifar_val.beton �(0x�(B �(0mqqqqqqqqqqqqqqqqqqqqqqqqqqvqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqj�(B logging in /tmp/10921 Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/content/datamodels/datamodels/training/worker.py", line 109, in main() File "/usr/local/lib/python3.10/dist-packages/fastargs/decorators.py", line 63, in result return func(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/fastargs/decorators.py", line 35, in call return self.func(*args, *filled_args) File "/content/datamodels/datamodels/training/worker.py", line 105, in main status = do_index(routine=routine) File "/usr/local/lib/python3.10/dist-packages/fastargs/decorators.py", line 63, in result return func(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/fastargs/decorators.py", line 35, in call return self.func(*args, filled_args) File "/content/datamodels/datamodels/training/worker.py", line 84, in do_index to_log = routine(index=index, logdir=str(worker_logs)) File "/content/datamodels/examples/cifar10/train_cifar.py", line 185, in main loaders = make_dataloaders(mask=np.nonzero(mask)[0]) File "/usr/local/lib/python3.10/dist-packages/fastargs/decorators.py", line 63, in result return func(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/fastargs/decorators.py", line 35, in call return self.func(args, filled_args) File "/content/datamodels/examples/cifar10/train_cifar.py", line 76, in make_dataloaders loaders[name] = Loader(paths[name], indices=(mask if name == 'train' else None), File "/usr/local/lib/python3.10/dist-packages/ffcv/loader/loader.py", line 210, in init self.generate_code() File "/usr/local/lib/python3.10/dist-packages/ffcv/loader/loader.py", line 275, in generate_code queries, code = self.graph.collect_requirements() File "/usr/local/lib/python3.10/dist-packages/ffcv/pipeline/graph.py", line 352, in collect_requirements self.collect_requirements(next_state, node, allocations, code, source_field=source_field) File "/usr/local/lib/python3.10/dist-packages/ffcv/pipeline/graph.py", line 352, in collect_requirements self.collect_requirements(next_state, node, allocations, code, source_field=source_field) File "/usr/local/lib/python3.10/dist-packages/ffcv/pipeline/graph.py", line 352, in collect_requirements self.collect_requirements(next_state, node, allocations, code, source_field=source_field) [Previous line repeated 3 more times] File "/usr/local/lib/python3.10/dist-packages/ffcv/pipeline/graph.py", line 333, in collect_requirements if next_state.device.type != 'cuda' and isinstance(operation, AttributeError: 'str' object has no attribute 'type'```

Can anyone help me understand why ffcv is throwing this error? Why is there a different semantics to access the device (i.e. why is there no type property on the systems I'm using as the ffcv library expects? And what is the correct way to handle this?

-Paul