google / trax

Trax — Deep Learning with Clear Code and Speed
Apache License 2.0
8.07k stars 813 forks source link

unmapped_aval() missing 1 required positional argument: 'aval' #1720

Open moeenm opened 2 years ago

moeenm commented 2 years ago

Description

I trained a NN on CPUs multiple times. At that time no GPU was detected on my machine. I installed Cuda using pip with the hope of using GPUs. After that, I ran the same code to train it using GPU and I received the following error:

TypeError: unmapped_aval() missing 1 required positional argument: 'aval'

I could not find any similar issue on the web. More details can be found in the error logs. ...

Environment information

OS: <Ubuntu 20.04.3 LTS >

$ pip freeze | grep trax
pip freeze | grep trax

$ pip freeze | grep tensor
mesh-tensorflow==0.1.19
tensorboard==2.6.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.0
tensorflow==2.6.2
tensorflow-datasets==4.4.0
tensorflow-estimator==2.6.0
tensorflow-gpu==2.6.0
tensorflow-hub==0.12.0
tensorflow-metadata==1.2.0
tensorflow-text==2.6.0

$ pip freeze | grep jax
jax==0.2.25
jaxlib==0.1.73+cuda11.cudnn805

$ python -V
Python 3.8.10

For bugs: reproduction and error logs

I used the following line to train the NN on CPU before. Now I am running it after installing CUDA with the hope of running it on GPU: train_model(GFA(), batch_size=100).run(4000)

These are the warning and errors that I get:

INFO:absl:Initializing hosts and devices: host_id 0, host_count 1, is_chief 1


TypeError Traceback (most recent call last) /tmp/ipykernel_3657782/2147776908.py in ----> 1 train_model(GFA(), batch_size=100).run(4000)

/tmp/ipykernel_3657782/1258604588.py in train_model(model, batch_size, I_train, B_train, M_train, I_test, B_test, M_test, n_steps, output_dir) 50 ) 51 ---> 52 training_loop = training.Loop(model, 53 train_task, 54 eval_tasks=eval_task,

~/.local/lib/python3.8/site-packages/trax/supervised/training.py in init(self, model, tasks, eval_model, eval_tasks, output_dir, checkpoint_at, permanent_checkpoint_at, eval_at, which_task, n_devices, random_seed, loss_chunk_size, use_memory_efficient_trainer, callbacks) 231 232 # Create the optimizer for the training loss function. --> 233 self._trainer_per_task = tuple(self._init_trainer(task) for task in tasks) 234 self.load_checkpoint() 235

~/.local/lib/python3.8/site-packages/trax/supervised/training.py in (.0) 231 232 # Create the optimizer for the training loss function. --> 233 self._trainer_per_task = tuple(self._init_trainer(task) for task in tasks) 234 self.load_checkpoint() 235

~/.local/lib/python3.8/site-packages/trax/supervised/training.py in _init_trainer(self, task) 280 ) 281 task.optimizer.tree_init(model_in_training.weights) --> 282 return optimizers.Trainer(model_in_training, task.optimizer) 283 # In the memory-efficient path, we initialize the model here. 284 blocks, loss_layer = optimizers.trainer.extract_reversible_blocks(

~/.local/lib/python3.8/site-packages/trax/optimizers/trainer.py in init(self, model_with_loss, optimizer, n_devices) 54 55 # optimizer slots and opt_params may need to be replicated ---> 56 self._slots, self._opt_params = tl.for_n_devices( 57 (self._optimizer.slots, self._optimizer.opt_params), self._n_devices) 58

~/.local/lib/python3.8/site-packages/trax/layers/acceleration.py in for_n_devices(x, n_devices) 234 else: 235 return x --> 236 return fastmath.nested_map(f, x) 237 238

~/.local/lib/python3.8/site-packages/trax/fastmath/numpy.py in nested_map(f, obj, level, ignore_nones) 107 return [nested_map(f, y, level=level) for y in obj] 108 if isinstance(obj, tuple): --> 109 return tuple([nested_map(f, y, level=level) for y in obj]) 110 if isinstance(obj, dict): 111 return {k: nested_map(f, v, level=level) for (k, v) in obj.items()}

~/.local/lib/python3.8/site-packages/trax/fastmath/numpy.py in (.0) 107 return [nested_map(f, y, level=level) for y in obj] 108 if isinstance(obj, tuple): --> 109 return tuple([nested_map(f, y, level=level) for y in obj]) 110 if isinstance(obj, dict): 111 return {k: nested_map(f, v, level=level) for (k, v) in obj.items()}

~/.local/lib/python3.8/site-packages/trax/fastmath/numpy.py in nested_map(f, obj, level, ignore_nones) 107 return [nested_map(f, y, level=level) for y in obj] 108 if isinstance(obj, tuple): --> 109 return tuple([nested_map(f, y, level=level) for y in obj]) 110 if isinstance(obj, dict): 111 return {k: nested_map(f, v, level=level) for (k, v) in obj.items()}

~/.local/lib/python3.8/site-packages/trax/fastmath/numpy.py in (.0) 107 return [nested_map(f, y, level=level) for y in obj] 108 if isinstance(obj, tuple): --> 109 return tuple([nested_map(f, y, level=level) for y in obj]) 110 if isinstance(obj, dict): 111 return {k: nested_map(f, v, level=level) for (k, v) in obj.items()}

~/.local/lib/python3.8/site-packages/trax/fastmath/numpy.py in nested_map(f, obj, level, ignore_nones) 105 return type(obj)(*nested_map(f, list(obj), level=level)) 106 if isinstance(obj, list): --> 107 return [nested_map(f, y, level=level) for y in obj] 108 if isinstance(obj, tuple): 109 return tuple([nested_map(f, y, level=level) for y in obj])

~/.local/lib/python3.8/site-packages/trax/fastmath/numpy.py in (.0) 105 return type(obj)(*nested_map(f, list(obj), level=level)) 106 if isinstance(obj, list): --> 107 return [nested_map(f, y, level=level) for y in obj] 108 if isinstance(obj, tuple): 109 return tuple([nested_map(f, y, level=level) for y in obj])

~/.local/lib/python3.8/site-packages/trax/fastmath/numpy.py in nested_map(f, obj, level, ignore_nones) 100 return None 101 else: --> 102 return f(obj) 103 104 if _is_namedtuple_instance(obj):

~/.local/lib/python3.8/site-packages/trax/layers/acceleration.py in f(x) 229 def f(x): 230 if n_devices > 1 and fastmath.is_backend(fastmath.Backend.JAX): --> 231 return _multi_device_put(x) 232 elif n_devices > 1: 233 return jnp.broadcast_to(x, (n_devices,) + jnp.asarray(x).shape)

~/.local/lib/python3.8/site-packages/trax/layers/acceleration.py in _multi_device_put(x, devices) 285 # but it does one PCI transfer and later uses ICI. 286 # TODO(lukaszkaiser): remove once JAX has a core function to do the same. --> 287 aval = jax.core.unmapped_aval(len(devices), 0, 288 jax.core.raise_to_shaped(jax.core.get_aval(x))) 289 buf, = jax.xla.device_put(x, devices[0]) # assuming single-buf repr

TypeError: unmapped_aval() missing 1 required positional argument: 'aval'

# Steps to reproduce: Train the network on GPU
...
# Error logs:
TypeError: unmapped_aval() missing 1 required positional argument: 'aval'
...
moeenm commented 2 years ago

I tried multiple things. I am not sure but I think the problem resolved after I ran, sudo apt install cuda Nothing new was installed but there was a message: The following packages were automatically installed and are no longer required: cuda-cccl-11-4 cuda-command-line-tools-11-4 cuda-compiler-11-4 cuda-cudart-11-4 cuda-cudart-dev-11-4 cuda-cuobjdump-11-4 cuda-cupti-11-4 cuda-cupti-dev-11-4 cuda-cuxxfilt-11-4 cuda-documentation-11-4 cuda-driver-dev-11-4 cuda-gdb-11-4 cuda-libraries-11-4 cuda-libraries-dev-11-4 cuda-memcheck-11-4 cuda-nsight-11-4 cuda-nsight-compute-11-4 cuda-nsight-systems-11-4 cuda-nvcc-11-4 cuda-nvdisasm-11-4 cuda-nvml-dev-11-4 cuda-nvprof-11-4 cuda-nvprune-11-4 cuda-nvrtc-11-4 cuda-nvrtc-dev-11-4 cuda-nvtx-11-4 cuda-nvvp-11-4 cuda-samples-11-4 cuda-sanitizer-11-4 cuda-toolkit-11-4 cuda-toolkit-11-4-config-common cuda-tools-11-4 cuda-visual-tools-11-4 g++-8 gds-tools-11-4 javascript-common lib32gcc-s1 lib32stdc++6 libaccinj64-10.1 libc6-i386 libclang-10-dev libclang-common-10-dev libclang-dev libclang1-10 libcublas-11-4 libcublas-dev-11-4 libcublaslt10 libcudart10.1 libcufft-11-4 libcufft-dev-11-4 libcufft10 libcufftw10 libcufile-11-4 libcufile-dev-11-4 libcuinj64-10.1 libcupti-dev libcupti-doc libcupti10.1 libcurand-11-4 libcurand-dev-11-4 libcurand10 libcusolver-11-4 libcusolver-dev-11-4 libcusolver10 libcusolvermg10 libcusparse-11-4 libcusparse-dev-11-4 libcusparse10 libegl-dev libgl-dev libgl1-mesa-dev libgles-dev libgles1 libglvnd-dev libglx-dev libjs-jquery libjs-underscore libllvm10 libncurses5 libnpp-11-4 libnpp-dev-11-4 libnppc10 libnppial10 libnppicc10 libnppicom10 libnppidei10 libnppif10 libnppig10 libnppim10 libnppist10 libnppisu10 libnppitc10 libnpps10 libnvblas10 libnvgraph10 libnvidia-common-470 libnvidia-ml-dev libnvjpeg-11-4 libnvjpeg-dev-11-4 libnvjpeg10 libnvrtc10.1 libnvtoolsext1 libnvvm3 libobjc-9-dev libobjc4 libopengl-dev libpq5 libstdc++-8-dev libthrust-dev libtinfo5 libvdpau-dev linux-headers-5.11.0-37-generic linux-hwe-5.11-headers-5.11.0-37 linux-image-5.11.0-34-generic linux-image-5.11.0-37-generic linux-modules-5.11.0-34-generic linux-modules-5.11.0-37-generic linux-modules-extra-5.11.0-34-generic linux-modules-extra-5.11.0-37-generic node-html5shiv nsight-compute nsight-compute-2021.2.2 nsight-compute-2021.3.0 nsight-systems nsight-systems-2021.3.2 nvidia-cuda-doc nvidia-cuda-gdb nvidia-opencl-dev nvidia-profiler nvidia-visual-profiler ocl-icd-opencl-dev opencl-c-headers Use 'sudo apt autoremove' to remove them.

Then I ran the following and it looks the problem is resolved, sudo apt autoremove