hqucms / weaver-core

Streamlined neural network training.
MIT License
44 stars 54 forks source link

A problem running the weaver on M2 Max #6

Closed Abdualazem closed 1 year ago

Abdualazem commented 1 year ago

Hi,

I'm trying to test training with ParTr on the M2 Max. I've seen that the recent version of the weaver supports the M1, but I don't know if it should work for the M2 too. Here's what I get when running the training with the GPU on with ' --gpus 0':

` [2023-07-11 17:44:45,693] INFO: Computational complexity: 632.51 MMac [2023-07-11 17:44:45,693] INFO: Number of parameters: 2.14 M [2023-07-11 17:44:45,693] INFO: Using loss function CrossEntropyLoss() with options {} [2023-07-11 17:44:45,756] INFO: Create Tensorboard summary writer with comment test_ParT_Wtag_v1_run1_20230711_174444 Traceback (most recent call last): File "/opt/homebrew/anaconda3/envs/weaver/bin/weaver", line 8, in sys.exit(main()) File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/weaver/train.py", line 931, in main _main(args) File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/weaver/train.py", line 748, in _main model = orig_model.to(dev) File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1145, in to return self._apply(convert) File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply module._apply(fn) File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply module._apply(fn) File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply module._apply(fn) File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/torch/nn/modules/module.py", line 820, in _apply param_applied = fn(param) File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1143, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/torch/cuda/init.py", line 239, in _lazy_init raise AssertionError("Torch not compiled with CUDA enabled") AssertionError: Torch not compiled with CUDA enabled

real 0m1.443s user 0m2.215s sys 0m4.826s `

But when switching the GPU off by ' --gpus "" \ ' I get a different error:

` [2023-07-11 17:40:09,773] INFO: Computational complexity: 632.51 MMac [2023-07-11 17:40:09,773] INFO: Number of parameters: 2.14 M [2023-07-11 17:40:09,773] INFO: Using loss function CrossEntropyLoss() with options {} [2023-07-11 17:40:09,820] INFO: Create Tensorboard summary writer with comment test_ParT_Wtag_v1_run1_20230711_174008 [2023-07-11 17:40:09,872] INFO: Optimizer options: {} [2023-07-11 17:40:09,874] INFO: -------------------------------------------------- [2023-07-11 17:40:09,874] INFO: Epoch #0 training 0it [00:00, ?it/s] Traceback (most recent call last): File "/opt/homebrew/anaconda3/envs/weaver/bin/weaver", line 8, in sys.exit(main()) File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/weaver/train.py", line 931, in main _main(args) File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/weaver/train.py", line 784, in _main train(model, loss_func, opt, scheduler, train_loader, dev, epoch, File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/weaver/utils/nn/tools.py", line 45, in trainclassification for X, y, in tq: File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/tqdm/std.py", line 1178, in iter for obj in iterable: File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 441, in iter return self._get_iterator() File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 388, in _get_iterator return _MultiProcessingDataLoaderIter(self) File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1042, in init w.start() File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/multiprocessing/process.py", line 121, in start self._popen = self._Popen(self) File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/multiprocessing/context.py", line 224, in _Popen return _default_context.get_context().Process._Popen(process_obj) File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/multiprocessing/context.py", line 288, in _Popen return Popen(process_obj) File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in init super().init(process_obj) File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/multiprocessing/popen_fork.py", line 19, in init self._launch(process_obj) File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 47, in _launch reduction.dump(process_obj, fp) File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/multiprocessing/reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/weaver/utils/data/config.py", line 204, in getattr return self.options[name] KeyError: 'getstate'

real 0m1.322s user 0m2.492s sys 0m4.570s ` Of course, the code works fine when tested in lxslc7 in the IHEP cluster. Is there anyone come across a similar error before? Can someone help with this? Thanks.

Cheers, Abdualazem.

hqucms commented 1 year ago

Hi @Abdualazem -- It seems that --num-workers has to be set to 0 on Mac OS. Can you try that?

Abdualazem commented 1 year ago

Hi Huilin,

Thanks a lot for the prompt reply. Setting the --num-workers to 0 gives another error. I copied it below:

0it [00:00, ?it/s] Traceback (most recent call last): File "/opt/homebrew/anaconda3/envs/weaver/bin/weaver", line 8, in <module> sys.exit(main()) File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/weaver/train.py", line 931, in main _main(args) File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/weaver/train.py", line 784, in _main train(model, loss_func, opt, scheduler, train_loader, dev, epoch, File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/weaver/utils/nn/tools.py", line 58, in train_classification model_output = model(*inputs) File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/Users/abdualazem/QComputing/Transformer/Develop/PartTr/networks/example_ParticleTransformer.py", line 20, in forward return self.mod(features, v=lorentz_vectors, mask=mask) File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/Users/abdualazem/QComputing/Transformer/Develop/PartTr/networks/ParticleTransformer.py", line 423, in forward attn_mask = self.pair_embed(v).view(-1, v.size(-1), v.size(-1)) # (N*num_heads, P, P) File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/Users/abdualazem/QComputing/Transformer/Develop/PartTr/networks/ParticleTransformer.py", line 210, in forward i, j = torch.tril_indices(seq_len, seq_len, device=x.device) NotImplementedError: The operator 'aten::tril_indices' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable "PYTORCH_ENABLE_MPS_FALLBACK=1" to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.

real 0m2.218s user 0m2.337s sys 0m5.305s

I still keep the --gpus to be "". But when the --gpus is 0 or 1, I get the following error:

Traceback (most recent call last): File "/opt/homebrew/anaconda3/envs/weaver/bin/weaver", line 8, in <module> sys.exit(main()) File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/weaver/train.py", line 931, in main _main(args) File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/weaver/train.py", line 748, in _main model = orig_model.to(dev) File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1145, in to return self._apply(convert) File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply module._apply(fn) File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply module._apply(fn) File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply module._apply(fn) File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/torch/nn/modules/module.py", line 820, in _apply param_applied = fn(param) File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1143, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/torch/cuda/__init__.py", line 239, in _lazy_init raise AssertionError("Torch not compiled with CUDA enabled") AssertionError: Torch not compiled with CUDA enabled real 0m1.212s user 0m1.759s sys 0m5.239s

Cheers, Abdualazem.

hqucms commented 1 year ago

OK I think this is the issue:

The operator 'aten::tril_indices' is not currently implemented for the MPS device.

As a temporary fix, you can set the environment variable "PYTORCH_ENABLE_MPS_FALLBACK=1" to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.

So you can try setting the environment variable PYTORCH_ENABLE_MPS_FALLBACK=1 in your script / command line.

Abdualazem commented 1 year ago

Hi Huilin,

Thanks for the help. Setting the environment variable to PYTORCH_ENABLE_MPS_FALLBACK=1 solve the problem. However, this means that I'm using the CPU only. Is there any way to use the MPS resources?

Cheers, Abudalazem.

hqucms commented 1 year ago

That depends on when the needed operators will be supported in PyTorch:

NotImplementedError: The operator 'aten::tril_indices' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764.
Abdualazem commented 1 year ago

I see, thanks for sharing that link.