Closed yyyyy-aa closed 4 months ago
Failures:
It seems that latest MONAI removes the API "AddChanneld", try "EnsureChannelFirstd" instead.
CUDA_VISIBLE_DEVICES=0 torchrun --nproc_per_node=1 train.py --mixed --benchmark --task la --exp_name running --wandb --entity xxx /usr/lib/python3/dist-packages/requests/init.py:87: RequestsDependencyWarning: urllib3 (2.2.1) or chardet (4.0.0) doesn't match a supported version! warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported " | distributed init (rank 0): env://
Semi-Supervised Medical Image Segmentation Training Mixed Precision - True; CUDNN Benchmark - True; Num GPU - 1; Num Worker - 8 successfully loaded config file: {'MODEL': {'PROJECT_DIM': 64, 'LEAKY': True, 'NORM': 'BATCH'}, 'TRAIN': {'LR': 0.01, 'MOMENTUM': 0.9, 'DECAY': 0.0001, 'BURN_IN': 5, 'BURN': 0, 'RAMPUP': 100, 'EPOCHS': 100, 'BATCHSIZE': 1, 'SEED': 42, 'RATIO': 0.1, 'LOSS_TYPE': 1, 'SAMPLE_NUM': 400, 'BUFFER_SIZE': 1, 'CPS_RATIO': 0.1, 'CON_RATIO': 0.1}, 'TEST': {'BATCHSIZE': 4}} Traceback (most recent call last): File "/home/chaijingwen/RCPS-main/train.py", line 184, in
main()
File "/home/chaijingwen/RCPS-main/train.py", line 74, in main
AddChanneld(keys=['image', 'label'], allow_missing_keys=True),
NameError: name 'AddChanneld' is not defined
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1208857) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/home/ccj/.local/bin/torchrun", line 8, in
sys.exit(main())
File "/home/ccj/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/ccj/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/home/ccj/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/home/ccj/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ccj/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train.py FAILED
Failures: