train not success - Githubissues

xiyangyang99 commented 11 months ago

on 8*RTX 3090 cant train! this is my train script : CUDA_VISIBLE_DEVICES=4,5,6,7 python3 -m torch.distributed.launch --master_port=12000 --nnodes 1 --nproc_per_node 4 train.py --config /home/quchunguang/003-large-model/SAM-Adapter-PyTorch/configs/cod-sam-vit-h.yaml --tag exp1

this is train logs /home/quchunguang/anaconda3/envs/SAM-Adapter/lib/python3.8/site-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

warnings.warn( WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

/home/quchunguang/anaconda3/envs/SAM-Adapter/lib/python3.8/site-packages/mmcv/init.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details. warnings.warn( /home/quchunguang/anaconda3/envs/SAM-Adapter/lib/python3.8/site-packages/mmcv/init.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details. warnings.warn( /home/quchunguang/anaconda3/envs/SAM-Adapter/lib/python3.8/site-packages/mmcv/init.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details. warnings.warn( /home/quchunguang/anaconda3/envs/SAM-Adapter/lib/python3.8/site-packages/mmcv/init.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details. warnings.warn(

and always ........

not any next train output context .................

how can deal with this question?

tianrun-chen commented 11 months ago

Greetings! As the current application will utilize over 30G of memory for batchsize=1, we suggest considering alternative graphics cards with greater memory capacity.

xiyangyang99 commented 11 months ago

Greetings! As the current application will utilize over 30G of memory for batchsize=1, we suggest considering alternative graphics cards with greater memory capacity.

Thank you for your reply. I am using 8 * 3090Nvidia and the computer memory is 188Gb. There was no log output during the training process. The graphics card didn't respond either

tianrun-chen / SAM-Adapter-PyTorch

train not success #51