Closed Abdualazem closed 1 year ago
Hi @Abdualazem -- It seems that --num-workers
has to be set to 0
on Mac OS. Can you try that?
Hi Huilin,
Thanks a lot for the prompt reply. Setting the --num-workers
to 0
gives another error. I copied it below:
0it [00:00, ?it/s] Traceback (most recent call last): File "/opt/homebrew/anaconda3/envs/weaver/bin/weaver", line 8, in <module> sys.exit(main()) File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/weaver/train.py", line 931, in main _main(args) File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/weaver/train.py", line 784, in _main train(model, loss_func, opt, scheduler, train_loader, dev, epoch, File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/weaver/utils/nn/tools.py", line 58, in train_classification model_output = model(*inputs) File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/Users/abdualazem/QComputing/Transformer/Develop/PartTr/networks/example_ParticleTransformer.py", line 20, in forward return self.mod(features, v=lorentz_vectors, mask=mask) File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/Users/abdualazem/QComputing/Transformer/Develop/PartTr/networks/ParticleTransformer.py", line 423, in forward attn_mask = self.pair_embed(v).view(-1, v.size(-1), v.size(-1)) # (N*num_heads, P, P) File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/Users/abdualazem/QComputing/Transformer/Develop/PartTr/networks/ParticleTransformer.py", line 210, in forward i, j = torch.tril_indices(seq_len, seq_len, device=x.device) NotImplementedError: The operator 'aten::tril_indices' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable "PYTORCH_ENABLE_MPS_FALLBACK=1" to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.
real 0m2.218s user 0m2.337s sys 0m5.305s
I still keep the --gpus
to be ""
. But when the --gpus
is 0 or 1
, I get the following error:
Traceback (most recent call last): File "/opt/homebrew/anaconda3/envs/weaver/bin/weaver", line 8, in <module> sys.exit(main()) File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/weaver/train.py", line 931, in main _main(args) File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/weaver/train.py", line 748, in _main model = orig_model.to(dev) File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1145, in to return self._apply(convert) File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply module._apply(fn) File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply module._apply(fn) File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply module._apply(fn) File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/torch/nn/modules/module.py", line 820, in _apply param_applied = fn(param) File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1143, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/torch/cuda/__init__.py", line 239, in _lazy_init raise AssertionError("Torch not compiled with CUDA enabled") AssertionError: Torch not compiled with CUDA enabled real 0m1.212s user 0m1.759s sys 0m5.239s
Cheers, Abdualazem.
OK I think this is the issue:
The operator 'aten::tril_indices' is not currently implemented for the MPS device.
As a temporary fix, you can set the environment variable "PYTORCH_ENABLE_MPS_FALLBACK=1" to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.
So you can try setting the environment variable PYTORCH_ENABLE_MPS_FALLBACK=1
in your script / command line.
Hi Huilin,
Thanks for the help. Setting the environment variable to PYTORCH_ENABLE_MPS_FALLBACK=1
solve the problem. However, this means that I'm using the CPU only. Is there any way to use the MPS resources?
Cheers, Abudalazem.
That depends on when the needed operators will be supported in PyTorch:
NotImplementedError: The operator 'aten::tril_indices' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764.
I see, thanks for sharing that link.
Hi,
I'm trying to test training with ParTr on the M2 Max. I've seen that the recent version of the weaver supports the M1, but I don't know if it should work for the M2 too. Here's what I get when running the training with the GPU on with ' --gpus 0':
` [2023-07-11 17:44:45,693] INFO: Computational complexity: 632.51 MMac [2023-07-11 17:44:45,693] INFO: Number of parameters: 2.14 M [2023-07-11 17:44:45,693] INFO: Using loss function CrossEntropyLoss() with options {} [2023-07-11 17:44:45,756] INFO: Create Tensorboard summary writer with comment test_ParT_Wtag_v1_run1_20230711_174444 Traceback (most recent call last): File "/opt/homebrew/anaconda3/envs/weaver/bin/weaver", line 8, in
sys.exit(main())
File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/weaver/train.py", line 931, in main
_main(args)
File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/weaver/train.py", line 748, in _main
model = orig_model.to(dev)
File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1145, in to
return self._apply(convert)
File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/torch/nn/modules/module.py", line 820, in _apply
param_applied = fn(param)
File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1143, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/torch/cuda/init.py", line 239, in _lazy_init
raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled
real 0m1.443s user 0m2.215s sys 0m4.826s `
But when switching the GPU off by ' --gpus "" \ ' I get a different error:
` [2023-07-11 17:40:09,773] INFO: Computational complexity: 632.51 MMac [2023-07-11 17:40:09,773] INFO: Number of parameters: 2.14 M [2023-07-11 17:40:09,773] INFO: Using loss function CrossEntropyLoss() with options {} [2023-07-11 17:40:09,820] INFO: Create Tensorboard summary writer with comment test_ParT_Wtag_v1_run1_20230711_174008 [2023-07-11 17:40:09,872] INFO: Optimizer options: {} [2023-07-11 17:40:09,874] INFO: -------------------------------------------------- [2023-07-11 17:40:09,874] INFO: Epoch #0 training 0it [00:00, ?it/s] Traceback (most recent call last): File "/opt/homebrew/anaconda3/envs/weaver/bin/weaver", line 8, in
sys.exit(main())
File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/weaver/train.py", line 931, in main
_main(args)
File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/weaver/train.py", line 784, in _main
train(model, loss_func, opt, scheduler, train_loader, dev, epoch,
File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/weaver/utils/nn/tools.py", line 45, in trainclassification
for X, y, in tq:
File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/tqdm/std.py", line 1178, in iter
for obj in iterable:
File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 441, in iter
return self._get_iterator()
File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 388, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1042, in init
w.start()
File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/multiprocessing/context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
return Popen(process_obj)
File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in init
super().init(process_obj)
File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/multiprocessing/popen_fork.py", line 19, in init
self._launch(process_obj)
File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/weaver/utils/data/config.py", line 204, in getattr
return self.options[name]
KeyError: 'getstate'
real 0m1.322s user 0m2.492s sys 0m4.570s ` Of course, the code works fine when tested in lxslc7 in the IHEP cluster. Is there anyone come across a similar error before? Can someone help with this? Thanks.
Cheers, Abdualazem.