NaN and Inf errors - Githubissues

ZhouHongLiang6 commented 1 month ago

Hi @mlzxy , I encountered a strange issue when running in parallel. When I run the model on a single 4060ti 16GB GPU, it works fine on COCO without any NaN and Inf errors. However, when I train the model in parallel on four 3080X2 GPUs with a total of 80GB, after dozens of iterations, it reports NaN and Inf errors. I'd like to ask you how to resolve this?

Below are my running process and error messages:

task=ovd, vit=l, dataset=coco, shot=10, split=1, num_gpus=4 Running command: python /root/autodl-tmp/devit-main/tools/train_net.py --num-gpus 4 --config-file /root/autodl-tmp/devit-main/configs/open-vocabulary/coco/vitl.yaml MODEL.WEIGHTS /root/autodl-tmp/devit-main/weights/initial/open-vocabulary/vitl+rpn.pth DE.OFFLINE_RPN_CONFIG /root/autodl-tmp/devit-main/configs/RPN/mask_rcnn_R_50_C4_1x_ovd_FSD.yaml OUTPUT_DIR /root/autodl-tmp/devit-main/output/train/open-vocabulary/coco/vitl/ xFormers not available xFormers not available Command Line Args: Namespace(config_file='/root/autodl-tmp/devit-main/configs/open-vocabulary/coco/vitl.yaml', dist_url='auto', eval_only=False, machine_rank=0, num_gpus=4, num_machines=1, opts=['MODEL.WEIGHTS', '/root/autodl-tmp/devit-main/weights/initial/open-vocabulary/vitl+rpn.pth', 'DE.OFFLINE_RPN_CONFIG', '/root/autodl-tmp/devit-main/configs/RPN/mask_rcnn_R_50_C4_1x_ovd_FSD.yaml', 'OUTPUT_DIR', '/root/autodl-tmp/devit-main/output/train/open-vocabulary/coco/vitl/'], resume=False) xFormers not available xFormers not available xFormers not available xFormers not available xFormers not available xFormers not available xFormers not available xFormers not available [10/19 00:42:46 detectron2]: Rank of current process: 0. World size: 4 [10/19 00:42:47 detectron2]: Environment info:

sys.platform linux Python 3.8.10 (default, Jun 4 2021, 15:09:15) [GCC 7.5.0] numpy 1.21.4 detectron2 RegionCLIP @/root/autodl-tmp/devit-main/tools/../detectron2 Compiler GCC 9.3 CUDA compiler CUDA 11.3 detectron2 arch flags 8.6 DETECTRON2_ENV_MODULE PyTorch 1.10.0+cu113 @/root/miniconda3/lib/python3.8/site-packages/torch PyTorch debug build False GPU available True GPU 0,1,2,3 NVIDIA GeForce RTX 3080 (arch=8.6) CUDA_HOME /usr/local/cuda Pillow 8.4.0 torchvision 0.11.1+cu113 @/root/miniconda3/lib/python3.8/site-packages/torchvision torchvision arch flags 3.5, 5.0, 6.0, 7.0, 7.5, 8.0, 8.6 fvcore 0.1.5.post20221221 iopath 0.1.10 cv2 Not found

PyTorch built with:

GCC 7.3
C++ Version: 201402
Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
OpenMP 201511 (a.k.a. OpenMP 4.5)
LAPACK is enabled (usually provided by MKL)
NNPACK is enabled
CPU capability usage: AVX512
CUDA Runtime 11.3
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
CuDNN 8.2
Magma 2.5.2
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

[10/19 00:42:47 detectron2]: Command line arguments: Namespace(config_file='/root/autodl-tmp/devit-main/configs/open-vocabulary/coco/vitl.yaml', dist_url='auto', eval_only=False, machine_rank=0, num_gpus=4, num_machines=1, opts=['MODEL.WEIGHTS', '/root/autodl-tmp/devit-main/weights/initial/open-vocabulary/vitl+rpn.pth', 'DE.OFFLINE_RPN_CONFIG', '/root/autodl-tmp/devit-main/configs/RPN/mask_rcnn_R_50_C4_1x_ovd_FSD.yaml', 'OUTPUT_DIR', '/root/autodl-tmp/devit-main/output/train/open-vocabulary/coco/vitl/'], resume=False) [10/19 00:42:47 detectron2]: Contents of args.config_file=/root/autodl-tmp/devit-main/configs/open-vocabulary/coco/vitl.yaml: BASE: "../../Base-RCNN-C4.yaml" DE: CLASS_PROTOTYPES: "/root/autodl-tmp/devit-main/weights/initial/open-vocabulary/prototypes/coco/class_prototypes_base.vitl14.pth,/root/autodl-tmp/devit-main/weights/initial/open-vocabulary/prototypes/coco/class_prototypes_novel.vitl14.pth" BG_PROTOTYPES: "/root/autodl-tmp/devit-main/weights/initial/background/background_prototypes.vitl14.pth" BG_CLS_LOSS_WEIGHT: 0.2 TOPK: 5

MODEL: META_ARCHITECTURE: "OpenSetDetectorWithExamples_refactored" BACKBONE: NAME: "build_dino_v2_vit" TYPE: "large" WEIGHTS: "" MASK_ON: False RPN: HEAD_NAME: StandardRPNHead IN_FEATURES: ["res4"] ROI_HEADS: SCORE_THRESH_TEST: 0.001 ROI_BOX_HEAD: NAME: "" NUM_FC: 0 POOLER_RESOLUTION: 7 CLS_AGNOSTIC_BBOX_REG: True PIXEL_MEAN: [0.48145466, 0.4578275, 0.40821073] PIXEL_STD: [0.26862954, 0.26130258, 0.27577711] INPUT: MIN_SIZE_TRAIN: (640, 672, 704, 736, 768, 800) DATASETS: TRAIN: ("coco_2017_ovd_b_train",) TEST: ("coco_2017_ovd_all_test",) TEST: EVAL_PERIOD: 5000 SOLVER: IMS_PER_BATCH: 16 BASE_LR: 0.00002 STEPS: (60000, 80000) MAX_ITER: 90000 WARMUP_ITERS: 5000 CHECKPOINT_PERIOD: 5000

INPUT: MIN_SIZE_TRAIN_SAMPLING: choice MIN_SIZE_TRAIN: (640, 672, 704, 736, 768, 800) MAX_SIZE_TRAIN: 1333 MIN_SIZE_TEST: 800 MAX_SIZE_TEST: 1333 FORMAT: "RGB"

[10/19 00:42:48 detectron2]: Full config saved to /root/autodl-tmp/devit-main/output/train/open-vocabulary/coco/vitl/config.yaml ('coco_2017_ovd_all_test',) [10/19 00:42:48 d2.utils.env]: Using a generated random seed 48243271 ('coco_2017_ovd_all_test',) ('coco_2017_ovd_all_test',) ('coco_2017_ovd_all_test',) [10/19 00:43:10 d2.data.datasets.coco]: Loading datasets/coco/annotations/ovd_ins_train2017_b.json takes 13.41 seconds. [10/19 00:43:11 d2.data.datasets.coco]: Loaded 107761 images in COCO format from datasets/coco/annotations/ovd_ins_train2017_b.json [10/19 00:43:16 d2.data.build]: Removed 0 images with no usable annotations. 107761 images left. [10/19 00:43:20 d2.data.build]: Distribution of instances among all 48 categories:	category	#instances	category	#instances	category
person	257253	bicycle	7056	car	43533
motorcycle	8654	train	4570	truck	9970
boat	10576	bench	9820	bird	10542
horse	6567	sheep	9223	bear	1294
zebra	5269	giraffe	5128	backpack	8714
handbag	12342	suitcase	6112	frisbee	2681
skis	6623	kite	8802	surfboard	6095
bottle	24070	fork	5474	spoon	6159
bowl	14323	banana	9195	apple	5776
sandwich	4356	orange	6302	broccoli	7261
carrot	7758	pizza	5807	donut	7005
chair	38073	bed	4192	toilet	4149
tv	5803	laptop	4960	mouse	2261
remote	5700	microwave	1672	oven	3334
toaster	225	refrigerator	2634	book	24077
clock	6320	vase	6577	toothbrush	1945

total	656232

[10/19 00:43:20 d2.data.dataset_mapper]: [DatasetMapper] Augmentations used in training: [ResizeShortestEdge(short_edge_length=(640, 672, 704, 736, 768, 800), max_size=1333, sample_style='choice'), RandomFlip()] [10/19 00:43:20 d2.data.build]: Using training sampler TrainingSampler [10/19 00:43:20 d2.data.common]: Serializing 107761 elements to byte tensors and concatenating them all ... [10/19 00:43:24 d2.data.common]: Serialized dataset takes 361.37 MiB [10/19 00:43:26 fvcore.common.checkpoint]: [Checkpointer] Loading from /root/autodl-tmp/devit-main/weights/initial/open-vocabulary/vitl+rpn.pth ... WARNING [10/19 00:43:27 fvcore.common.checkpoint]: Some model parameters or buffers are not found in the checkpoint: background_linears.background.{bias, weight} background_linears.classes.{bias, weight} background_linears.feat.{bias, weight} bg_tokens foreground_linears.background.{bias, weight} foreground_linears.current_class.{bias, weight} foreground_linears.feat.{bias, weight} foreground_linears.other_classes.{bias, weight} rpropnet.layers.0.conv1x1.0.{bias, weight} rpropnet.layers.0.conv1x1.1.{bias, running_mean, running_var, weight} rpropnet.layers.0.conv2.{bias, weight} rpropnet.layers.0.conv3x3.0.{bias, weight} rpropnet.layers.0.conv3x3.1.{bias, running_mean, running_var, weight} rpropnet.layers.0.conv5x5.0.{bias, weight} rpropnet.layers.0.conv5x5.1.{bias, running_mean, running_var, weight} rpropnet.layers.0.conv_fusion.{bias, weight} rpropnet.layers.0.linear.{bias, weight} rpropnet.layers.0.region2box.{edge_weight, pool_h, pool_w, pos_x, pos_y, scale_factors, shape_factor} rpropnet.layers.0.scale_weights rpropnet.layers.1.conv1x1.0.{bias, weight} rpropnet.layers.1.conv1x1.1.{bias, running_mean, running_var, weight} rpropnet.layers.1.conv2.{bias, weight} rpropnet.layers.1.conv3x3.0.{bias, weight} rpropnet.layers.1.conv3x3.1.{bias, running_mean, running_var, weight} rpropnet.layers.1.conv5x5.0.{bias, weight} rpropnet.layers.1.conv5x5.1.{bias, running_mean, running_var, weight} rpropnet.layers.1.conv_fusion.{bias, weight} rpropnet.layers.1.linear.{bias, weight} rpropnet.layers.1.region2box.{edge_weight, pool_h, pool_w, pos_x, pos_y, scale_factors, shape_factor} rpropnet.layers.1.scale_weights rpropnet.layers.2.conv1x1.0.{bias, weight} rpropnet.layers.2.conv1x1.1.{bias, running_mean, running_var, weight} rpropnet.layers.2.conv2.{bias, weight} rpropnet.layers.2.conv3x3.0.{bias, weight} rpropnet.layers.2.conv3x3.1.{bias, running_mean, running_var, weight} rpropnet.layers.2.conv5x5.0.{bias, weight} rpropnet.layers.2.conv5x5.1.{bias, running_mean, running_var, weight} rpropnet.layers.2.conv_fusion.{bias, weight} rpropnet.layers.2.linear.{bias, weight} rpropnet.layers.2.region2box.{edge_weight, pool_h, pool_w, pos_x, pos_y, scale_factors, shape_factor} rpropnet.layers.2.scale_weights rpropnet_bg.layers.0.conv1x1.0.{bias, weight} rpropnet_bg.layers.0.conv1x1.1.{bias, running_mean, running_var, weight} rpropnet_bg.layers.0.conv2.{bias, weight} rpropnet_bg.layers.0.conv3x3.0.{bias, weight} rpropnet_bg.layers.0.conv3x3.1.{bias, running_mean, running_var, weight} rpropnet_bg.layers.0.conv5x5.0.{bias, weight} rpropnet_bg.layers.0.conv5x5.1.{bias, running_mean, running_var, weight} rpropnet_bg.layers.0.conv_fusion.{bias, weight} rpropnet_bg.layers.0.linear.{bias, weight} rpropnet_bg.layers.0.scale_weights rpropnet_bg.layers.1.conv1x1.0.{bias, weight} rpropnet_bg.layers.1.conv1x1.1.{bias, running_mean, running_var, weight} rpropnet_bg.layers.1.conv2.{bias, weight} rpropnet_bg.layers.1.conv3x3.0.{bias, weight} rpropnet_bg.layers.1.conv3x3.1.{bias, running_mean, running_var, weight} rpropnet_bg.layers.1.conv5x5.0.{bias, weight} rpropnet_bg.layers.1.conv5x5.1.{bias, running_mean, running_var, weight} rpropnet_bg.layers.1.conv_fusion.{bias, weight} rpropnet_bg.layers.1.linear.{bias, weight} rpropnet_bg.layers.1.scale_weights rpropnet_bg.layers.2.conv1x1.0.{bias, weight} rpropnet_bg.layers.2.conv1x1.1.{bias, running_mean, running_var, weight} rpropnet_bg.layers.2.conv2.{bias, weight} rpropnet_bg.layers.2.conv3x3.0.{bias, weight} rpropnet_bg.layers.2.conv3x3.1.{bias, running_mean, running_var, weight} rpropnet_bg.layers.2.conv5x5.0.{bias, weight} rpropnet_bg.layers.2.conv5x5.1.{bias, running_mean, running_var, weight} rpropnet_bg.layers.2.conv_fusion.{bias, weight} rpropnet_bg.layers.2.linear.{bias, weight} rpropnet_bg.layers.2.scale_weights test_class_weight train_class_weight [10/19 00:43:27 d2.engine.train_loop]: Starting training from iteration 0 xFormers not available xFormers not available xFormers not available xFormers not available xFormers not available xFormers not available xFormers not available xFormers not available xFormers not available xFormers not available xFormers not available xFormers not available xFormers not available xFormers not available xFormers not available xFormers not available xFormers not available xFormers not available xFormers not available xFormers not available xFormers not available xFormers not available xFormers not available xFormers not available xFormers not available xFormers not available xFormers not available xFormers not available xFormers not available xFormers not available /root/autodl-tmp/devit-main/tools/../detectron2/structures/boxes.py:158: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:201.) tensor = torch.as_tensor(tensor, dtype=torch.float32, device=device) /root/autodl-tmp/devit-main/tools/../detectron2/structures/boxes.py:158: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:201.) tensor = torch.as_tensor(tensor, dtype=torch.float32, device=device) /root/autodl-tmp/devit-main/tools/../detectron2/structures/boxes.py:158: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:201.) tensor = torch.as_tensor(tensor, dtype=torch.float32, device=device) /root/autodl-tmp/devit-main/tools/../detectron2/structures/boxes.py:158: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:201.) tensor = torch.as_tensor(tensor, dtype=torch.float32, device=device) xFormers not available xFormers not available /root/autodl-tmp/devit-main/tools/../detectron2/structures/boxes.py:158: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:201.) tensor = torch.as_tensor(tensor, dtype=torch.float32, device=device) /root/autodl-tmp/devit-main/tools/../detectron2/structures/boxes.py:158: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:201.) tensor = torch.as_tensor(tensor, dtype=torch.float32, device=device) /root/autodl-tmp/devit-main/tools/../detectron2/structures/boxes.py:158: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:201.) tensor = torch.as_tensor(tensor, dtype=torch.float32, device=device) /root/autodl-tmp/devit-main/tools/../detectron2/structures/boxes.py:158: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:201.) tensor = torch.as_tensor(tensor, dtype=torch.float32, device=device) /root/miniconda3/lib/python3.8/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:2157.) return _VF.meshgrid(tensors, kwargs) # type: ignore[attr-defined] /root/miniconda3/lib/python3.8/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:2157.) return _VF.meshgrid(tensors, kwargs) # type: ignore[attr-defined] /root/autodl-tmp/devit-main/tools/../detectron2/structures/image_list.py:101: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). max_size = (max_size + (stride - 1)) // stride stride /root/miniconda3/lib/python3.8/site-packages/torch/nn/functional.py:3631: UserWarning: Default upsampling behavior when mode=bicubic is changed to align_corners=False since 0.4.0. Please specify align_corners=True if the old behavior is desired. See the documentation of nn.Upsample for details. warnings.warn( /root/miniconda3/lib/python3.8/site-packages/torch/nn/functional.py:3679: UserWarning: The default behavior for interpolate/upsample with float scale_factor changed in 1.6.0 to align with other frameworks/libraries, and now uses scale_factor directly, instead of relying on the computed output size. If you wish to restore the old behavior, please set recompute_scale_factor=True. See the documentation of nn.Upsample for details. warnings.warn( /root/autodl-tmp/devit-main/tools/../detectron2/structures/image_list.py:101: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). max_size = (max_size + (stride - 1)) // stride stride /root/miniconda3/lib/python3.8/site-packages/torch/nn/functional.py:3631: UserWarning: Default upsampling behavior when mode=bicubic is changed to align_corners=False since 0.4.0. Please specify align_corners=True if the old behavior is desired. See the documentation of nn.Upsample for details. warnings.warn( /root/miniconda3/lib/python3.8/site-packages/torch/nn/functional.py:3679: UserWarning: The default behavior for interpolate/upsample with float scale_factor changed in 1.6.0 to align with other frameworks/libraries, and now uses scale_factor directly, instead of relying on the computed output size. If you wish to restore the old behavior, please set recompute_scale_factor=True. See the documentation of nn.Upsample for details. warnings.warn( /root/autodl-tmp/devit-main/tools/../detectron2/structures/boxes.py:158: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:201.) tensor = torch.as_tensor(tensor, dtype=torch.float32, device=device) /root/autodl-tmp/devit-main/tools/../detectron2/structures/boxes.py:158: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:201.) tensor = torch.as_tensor(tensor, dtype=torch.float32, device=device) /root/autodl-tmp/devit-main/tools/../detectron2/structures/boxes.py:158: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:201.) tensor = torch.as_tensor(tensor, dtype=torch.float32, device=device) /root/autodl-tmp/devit-main/tools/../detectron2/structures/boxes.py:158: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:201.) tensor = torch.as_tensor(tensor, dtype=torch.float32, device=device) /root/autodl-tmp/devit-main/tools/../detectron2/structures/boxes.py:158: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:201.) tensor = torch.as_tensor(tensor, dtype=torch.float32, device=device) /root/autodl-tmp/devit-main/tools/../detectron2/structures/boxes.py:158: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:201.) tensor = torch.as_tensor(tensor, dtype=torch.float32, device=device) /root/autodl-tmp/devit-main/tools/../detectron2/structures/boxes.py:158: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:201.) tensor = torch.as_tensor(tensor, dtype=torch.float32, device=device) /root/autodl-tmp/devit-main/tools/../detectron2/structures/boxes.py:158: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:201.) tensor = torch.as_tensor(tensor, dtype=torch.float32, device=device) /root/miniconda3/lib/python3.8/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:2157.) return _VF.meshgrid(tensors, *kwargs) # type: ignore[attr-defined] /root/autodl-tmp/devit-main/tools/../detectron2/structures/image_list.py:101: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). max_size = (max_size + (stride - 1)) // stride stride /root/miniconda3/lib/python3.8/site-packages/torch/nn/functional.py:3631: UserWarning: Default upsampling behavior when mode=bicubic is changed to align_corners=False since 0.4.0. Please specify align_corners=True if the old behavior is desired. See the documentation of nn.Upsample for details. warnings.warn( /root/miniconda3/lib/python3.8/site-packages/torch/nn/functional.py:3679: UserWarning: The default behavior for interpolate/upsample with float scale_factor changed in 1.6.0 to align with other frameworks/libraries, and now uses scale_factor directly, instead of relying on the computed output size. If you wish to restore the old behavior, please set recompute_scale_factor=True. See the documentation of nn.Upsample for details. warnings.warn( /root/miniconda3/lib/python3.8/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:2157.) return _VF.meshgrid(tensors, *kwargs) # type: ignore[attr-defined] /root/autodl-tmp/devit-main/tools/../detectron2/structures/image_list.py:101: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). max_size = (max_size + (stride - 1)) // stride stride /root/miniconda3/lib/python3.8/site-packages/torch/nn/functional.py:3631: UserWarning: Default upsampling behavior when mode=bicubic is changed to align_corners=False since 0.4.0. Please specify align_corners=True if the old behavior is desired. See the documentation of nn.Upsample for details. warnings.warn( /root/miniconda3/lib/python3.8/site-packages/torch/nn/functional.py:3679: UserWarning: The default behavior for interpolate/upsample with float scale_factor changed in 1.6.0 to align with other frameworks/libraries, and now uses scale_factor directly, instead of relying on the computed output size. If you wish to restore the old behavior, please set recompute_scale_factor=True. See the documentation of nn.Upsample for details. warnings.warn( /root/miniconda3/lib/python3.8/site-packages/torch/nn/functional.py:3631: UserWarning: Default upsampling behavior when mode=linear is changed to align_corners=False since 0.4.0. Please specify align_corners=True if the old behavior is desired. See the documentation of nn.Upsample for details. warnings.warn( /root/miniconda3/lib/python3.8/site-packages/torch/nn/functional.py:3631: UserWarning: Default upsampling behavior when mode=linear is changed to align_corners=False since 0.4.0. Please specify align_corners=True if the old behavior is desired. See the documentation of nn.Upsample for details. warnings.warn( /root/miniconda3/lib/python3.8/site-packages/torch/nn/functional.py:3631: UserWarning: Default upsampling behavior when mode=linear is changed to align_corners=False since 0.4.0. Please specify align_corners=True if the old behavior is desired. See the documentation of nn.Upsample for details. warnings.warn( /root/miniconda3/lib/python3.8/site-packages/torch/nn/functional.py:3631: UserWarning: Default upsampling behavior when mode=linear is changed to align_corners=False since 0.4.0. Please specify align_corners=True if the old behavior is desired. See the documentation of nn.Upsample for details. warnings.warn( [10/19 00:44:35 d2.utils.events]: eta: 2 days, 9:48:10 iter: 19 cls_acc: 0.377 fg_cls_acc: 0.1415 false_neg_ratio: 0.4454 total_loss: 14.53 region_bce_loss_0: 0.8209 region_dice_loss_0: 0.4581 rg_l1_loss_0: 0.1729 rg_giou_loss_0: 0.6444 region_bce_loss_1: 0.7708 region_dice_loss_1: 0.4803 rg_l1_loss_1: 0.1777 rg_giou_loss_1: 0.7464 region_bce_loss_2: 0.8171 region_dice_loss_2: 0.4638 rg_l1_loss_2: 0.1641 rg_giou_loss_2: 0.6602 focal_loss_0: 1.818 focal_loss_1: 2.62 focal_loss_2: 2.02 bbox_loss: 1.701 time: 2.4202 data_time: 0.9588 lr: 9.5924e-08 max_mem: 13444M [10/19 00:45:23 d2.utils.events]: eta: 2 days, 10:22:37 iter: 39 cls_acc: 0.3555 fg_cls_acc: 0.1322 false_neg_ratio: 0.435 total_loss: 14.58 region_bce_loss_0: 0.8328 region_dice_loss_0: 0.4677 rg_l1_loss_0: 0.1735 rg_giou_loss_0: 0.6536 region_bce_loss_1: 0.7815 region_dice_loss_1: 0.4868 rg_l1_loss_1: 0.1748 rg_giou_loss_1: 0.7421 region_bce_loss_2: 0.8249 region_dice_loss_2: 0.4722 rg_l1_loss_2: 0.162 rg_giou_loss_2: 0.6518 focal_loss_0: 1.813 focal_loss_1: 2.582 focal_loss_2: 2.041 bbox_loss: 1.692 time: 2.4050 data_time: 0.0083 lr: 1.7584e-07 max_mem: 13444M [10/19 00:46:12 d2.utils.events]: eta: 2 days, 10:38:57 iter: 59 cls_acc: 0.3584 fg_cls_acc: 0.125 false_neg_ratio: 0.4171 total_loss: 14.39 region_bce_loss_0: 0.8151 region_dice_loss_0: 0.4544 rg_l1_loss_0: 0.1679 rg_giou_loss_0: 0.6379 region_bce_loss_1: 0.7727 region_dice_loss_1: 0.4719 rg_l1_loss_1: 0.1642 rg_giou_loss_1: 0.7096 region_bce_loss_2: 0.8102 region_dice_loss_2: 0.4621 rg_l1_loss_2: 0.1567 rg_giou_loss_2: 0.641 focal_loss_0: 1.844 focal_loss_1: 2.557 focal_loss_2: 2.027 bbox_loss: 1.625 time: 2.4189 data_time: 0.0072 lr: 2.5576e-07 max_mem: 13446M [10/19 00:47:01 d2.utils.events]: eta: 2 days, 10:38:11 iter: 79 cls_acc: 0.3486 fg_cls_acc: 0.125 false_neg_ratio: 0.4303 total_loss: 14.3 region_bce_loss_0: 0.8158 region_dice_loss_0: 0.4548 rg_l1_loss_0: 0.1692 rg_giou_loss_0: 0.6417 region_bce_loss_1: 0.7775 region_dice_loss_1: 0.4684 rg_l1_loss_1: 0.1591 rg_giou_loss_1: 0.6956 region_bce_loss_2: 0.8036 region_dice_loss_2: 0.46 rg_l1_loss_2: 0.1536 rg_giou_loss_2: 0.6177 focal_loss_0: 1.811 focal_loss_1: 2.528 focal_loss_2: 2.053 bbox_loss: 1.59 time: 2.4277 data_time: 0.0084 lr: 3.3568e-07 max_mem: 13446M [10/19 00:47:49 d2.utils.events]: eta: 2 days, 10:27:29 iter: 99 cls_acc: 0.3652 fg_cls_acc: 0.1614 false_neg_ratio: 0.4153 total_loss: 13.95 region_bce_loss_0: 0.8207 region_dice_loss_0: 0.4569 rg_l1_loss_0: 0.1681 rg_giou_loss_0: 0.6349 region_bce_loss_1: 0.7867 region_dice_loss_1: 0.4669 rg_l1_loss_1: 0.1515 rg_giou_loss_1: 0.6631 region_bce_loss_2: 0.801 region_dice_loss_2: 0.4599 rg_l1_loss_2: 0.1468 rg_giou_loss_2: 0.604 focal_loss_0: 1.805 focal_loss_1: 2.478 focal_loss_2: 2 bbox_loss: 1.503 time: 2.4256 data_time: 0.0072 lr: 4.156e-07 max_mem: 13446M [10/19 00:48:35 d2.utils.events]: eta: 2 days, 10:13:32 iter: 119 cls_acc: 0.3564 fg_cls_acc: 0.136 false_neg_ratio: 0.4027 total_loss: 13.86 region_bce_loss_0: 0.8142 region_dice_loss_0: 0.4532 rg_l1_loss_0: 0.1657 rg_giou_loss_0: 0.6187 region_bce_loss_1: 0.7876 region_dice_loss_1: 0.4616 rg_l1_loss_1: 0.1483 rg_giou_loss_1: 0.6428 region_bce_loss_2: 0.7902 region_dice_loss_2: 0.4556 rg_l1_loss_2: 0.1433 rg_giou_loss_2: 0.5839 focal_loss_0: 1.818 focal_loss_1: 2.422 focal_loss_2: 2.004 bbox_loss: 1.47 time: 2.4009 data_time: 0.0067 lr: 4.9552e-07 max_mem: 13446M [10/19 00:49:21 d2.utils.events]: eta: 2 days, 9:58:13 iter: 139 cls_acc: 0.3574 fg_cls_acc: 0.1561 false_neg_ratio: 0.3934 total_loss: 13.45 region_bce_loss_0: 0.8215 region_dice_loss_0: 0.4596 rg_l1_loss_0: 0.1685 rg_giou_loss_0: 0.6262 region_bce_loss_1: 0.8 region_dice_loss_1: 0.4645 rg_l1_loss_1: 0.1469 rg_giou_loss_1: 0.6314 region_bce_loss_2: 0.7896 region_dice_loss_2: 0.4593 rg_l1_loss_2: 0.1356 rg_giou_loss_2: 0.5685 focal_loss_0: 1.76 focal_loss_1: 2.321 focal_loss_2: 1.927 bbox_loss: 1.352 time: 2.3899 data_time: 0.0090 lr: 5.7544e-07 max_mem: 13446M [10/19 00:50:09 d2.utils.events]: eta: 2 days, 10:08:15 iter: 159 cls_acc: 0.3535 fg_cls_acc: 0.1476 false_neg_ratio: 0.395 total_loss: 13.09 region_bce_loss_0: 0.814 region_dice_loss_0: 0.455 rg_l1_loss_0: 0.1638 rg_giou_loss_0: 0.6173 region_bce_loss_1: 0.7942 region_dice_loss_1: 0.4618 rg_l1_loss_1: 0.1474 rg_giou_loss_1: 0.6222 region_bce_loss_2: 0.7747 region_dice_loss_2: 0.4547 rg_l1_loss_2: 0.1295 rg_giou_loss_2: 0.5477 focal_loss_0: 1.733 focal_loss_1: 2.236 focal_loss_2: 1.904 bbox_loss: 1.284 time: 2.3884 data_time: 0.0073 lr: 6.5536e-07 max_mem: 13446M Warning: NaN or Inf detected in result. Clamping to valid range. Warning: NaN or Inf detected in result. Clamping to valid range. Warning: NaN or Inf detected in result. Clamping to valid range. Warning: NaN or Inf detected in result. Clamping to valid range. Warning: NaN or Inf detected in result. Clamping to valid range. Warning: NaN or Inf detected in result. Clamping to valid range. Cross entropy loss contains NaN or Inf Cross entropy loss contains NaN or Inf Cross entropy loss contains NaN or Inf Cross entropy loss contains NaN or Inf Cross entropy loss contains NaN or Inf Cross entropy loss contains NaN or Inf Warning: NaN or Inf detected in result. Clamping to valid range. Warning: NaN or Inf detected in result. Clamping to valid range. Warning: NaN or Inf detected in result. Clamping to valid range. Warning: NaN or Inf detected in result. Clamping to valid range. Warning: NaN or Inf detected in result. Clamping to valid range. Cross entropy loss contains NaN or Inf Cross entropy loss contains NaN or Inf Cross entropy loss contains NaN or Inf Warning: NaN or Inf detected in result. Clamping to valid range. Cross entropy loss contains NaN or Inf Cross entropy loss contains NaN or Inf Cross entropy loss contains NaN or Inf ERROR [10/19 00:50:56 d2.engine.train_loop]: Exception during training: Traceback (most recent call last): File "/root/autodl-tmp/devit-main/tools/../detectron2/engine/train_loop.py", line 149, in train self.run_step() File "/root/autodl-tmp/devit-main/tools/../detectron2/engine/defaults.py", line 506, in run_step self._trainer.run_step() File "/root/autodl-tmp/devit-main/tools/../detectron2/engine/train_loop.py", line 287, in run_step self._write_metrics(loss_dict, data_time) File "/root/autodl-tmp/devit-main/tools/../detectron2/engine/train_loop.py", line 329, in _write_metrics raise FloatingPointError( FloatingPointError: Loss became infinite or NaN at iteration=179! loss_dict = {'region_bce_loss_0': nan, 'region_dice_loss_0': nan, 'rg_l1_loss_0': nan, 'region_bce_loss_1': nan, 'region_dice_loss_1': nan, 'rg_l1_loss_1': nan, 'region_bce_loss_2': nan, 'region_dice_loss_2': nan, 'rg_l1_loss_2': nan, 'focal_loss_0': 0.07109161466360092, 'focal_loss_1': 0.07109161466360092, 'focal_loss_2': 0.07109161466360092, 'bbox_loss': nan} [10/19 00:50:56 d2.engine.hooks]: Overall training speed: 177 iterations in 0:07:04 (2.3984 s / it) [10/19 00:50:56 d2.engine.hooks]: Total training time: 0:07:04 (0:00:00 on hooks) [10/19 00:50:56 d2.utils.events]: eta: 2 days, 10:10:12 iter: 179 cls_acc: 0.334 fg_cls_acc: 0.1693 false_neg_ratio: 0.3984 total_loss: 13.06 region_bce_loss_0: 0.8246 region_dice_loss_0: 0.4651 rg_l1_loss_0: 0.1633 rg_giou_loss_0: 0.6177 region_bce_loss_1: 0.8023 region_dice_loss_1: 0.4675 rg_l1_loss_1: 0.145 rg_giou_loss_1: 0.6136 region_bce_loss_2: 0.7779 region_dice_loss_2: 0.4622 rg_l1_loss_2: 0.1272 rg_giou_loss_2: 0.5406 focal_loss_0: 1.743 focal_loss_1: 2.192 focal_loss_2: 1.913 bbox_loss: 1.246 time: 2.3870 data_time: 0.0075 lr: 7.3129e-07 max_mem: 13446M Traceback (most recent call last): File "/root/autodl-tmp/devit-main/tools/train_net.py", line 204, in launch( File "/root/autodl-tmp/devit-main/tools/../detectron2/engine/launch.py", line 67, in launch mp.spawn( File "/root/miniconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/root/miniconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes while not context.join(): File "/root/miniconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 2 terminated with the following error: Traceback (most recent call last): File "/root/miniconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, args) File "/root/autodl-tmp/devit-main/tools/../detectron2/engine/launch.py", line 125, in _distributed_worker main_func(args) File "/root/autodl-tmp/devit-main/tools/train_net.py", line 197, in main return trainer.train() File "/root/autodl-tmp/devit-main/tools/../detectron2/engine/defaults.py", line 496, in train super().train(self.start_iter, self.max_iter) File "/root/autodl-tmp/devit-main/tools/../detectron2/engine/train_loop.py", line 149, in train self.run_step() File "/root/autodl-tmp/devit-main/tools/../detectron2/engine/defaults.py", line 506, in run_step self._trainer.run_step() File "/root/autodl-tmp/devit-main/tools/../detectron2/engine/train_loop.py", line 273, in run_step loss_dict = self.model(data) File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 873, in forward if torch.is_grad_enabled() and self.reducer._rebuild_buckets(): RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by making sure all forward function outputs participate in calculating loss. If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable). Parameter indices which did not receive grad for rank 2: 8 9 10 11 12 13 31 32 55 56 79 80 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error

Traceback (most recent call last): File "train.py", line 50, in run_command(command) File "train.py", line 29, in run_command subprocess.run(command, shell=True, check=True) File "/root/miniconda3/lib/python3.8/subprocess.py", line 516, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command 'python /root/autodl-tmp/devit-main/tools/train_net.py --num-gpus 4 --config-file /root/autodl-tmp/devit-main/configs/open-vocabulary/coco/vitl.yaml MODEL.WEIGHTS /root/autodl-tmp/devit-main/weights/initial/open-vocabulary/vitl+rpn.pth DE.OFFLINE_RPN_CONFIG /root/autodl-tmp/devit-main/configs/RPN/mask_rcnn_R_50_C4_1x_ovd_FSD.yaml OUTPUT_DIR /root/autodl-tmp/devit-main/output/train/open-vocabulary/coco/vitl/' returned non-zero exit status 1.

Many Thanks, Best, Hongliang

mlzxy commented 1 month ago

That is quite weird. I always trained the model with 4 GPUs and haven't gotten any NaN issues before. Especially it shall not happen at the start of the training because of warming up.

How large is your batch size and learning rate? The following are the default parameters:

SOLVER:
  IMS_PER_BATCH: 16
  BASE_LR: 0.002
  STEPS: (60000, 80000)
  MAX_ITER: 90000
  WARMUP_ITERS: 5000
  CHECKPOINT_PERIOD: 5000

ZhouHongLiang6 commented 1 month ago

Hi @mlzxy , to reduce memory consumption, I lowered the TOPK. I also tried reducing the learning rate to avoid NaN, but NaN and Inf still occurred during the warm-up phase. I'm not sure if this is an issue with the GPUs or with my parameter settings. Below are my parameter settings:

BASE: "../../Base-RCNN-C4.yaml" DE: CLASS_PROTOTYPES: "/root/autodl-tmp/devit-main/weights/initial/open-vocabulary/prototypes/coco/class_prototypes_base.vitl14.pth,/root/autodl-tmp/devit-main/weights/initial/open-vocabulary/prototypes/coco/class_prototypes_novel.vitl14.pth" BG_PROTOTYPES: "/root/autodl-tmp/devit-main/weights/initial/background/background_prototypes.vitl14.pth" BG_CLS_LOSS_WEIGHT: 0.2 TOPK: 5

MODEL: META_ARCHITECTURE: "OpenSetDetectorWithExamples_refactored" BACKBONE: NAME: "build_dino_v2_vit" TYPE: "large" WEIGHTS: "" MASK_ON: False RPN: HEAD_NAME: StandardRPNHead IN_FEATURES: ["res4"] ROI_HEADS: SCORE_THRESH_TEST: 0.001 ROI_BOX_HEAD: NAME: "" NUM_FC: 0 POOLER_RESOLUTION: 7 CLS_AGNOSTIC_BBOX_REG: True PIXEL_MEAN: [0.48145466, 0.4578275, 0.40821073] PIXEL_STD: [0.26862954, 0.26130258, 0.27577711] INPUT: MIN_SIZE_TRAIN: (640, 672, 704, 736, 768, 800) DATASETS: TRAIN: ("coco_2017_ovd_b_train",) TEST: ("coco_2017_ovd_all_test",) TEST: EVAL_PERIOD: 5000 SOLVER: IMS_PER_BATCH: 16 BASE_LR: 0.000002 STEPS: (60000, 80000) MAX_ITER: 90000 WARMUP_ITERS: 5000 CHECKPOINT_PERIOD: 5000

INPUT: MIN_SIZE_TRAIN_SAMPLING: choice MIN_SIZE_TRAIN: (640, 672, 704, 736, 768, 800) MAX_SIZE_TRAIN: 1333 MIN_SIZE_TEST: 800 MAX_SIZE_TEST: 1333 FORMAT: "RGB"

mlzxy commented 1 month ago

Just want to confirm, it is ok for one GPU, but not for 4 GPU?

I used to find xformers sometimes produce incorrect computation in certain GPUs. Could you try uninstall xformers at the moment and then try with ViT-small? (because xformers save lots of memory and ViT-L may not fit without it)

mlzxy commented 1 month ago

Another thing you could try is to monitor loss, and step into pudb at the first moment it becomes NaN, and trace back where the NaN happens initially.

ZhouHongLiang6 commented 1 month ago

Hi @mlzxy , I want to achieve your ap50 53 accuracy on the COCO dataset, so after running it on one GPU on my computer, I'm moving to the server to run on four GPUs. However, my xformers==0.0.11 is configured with PyTorch 1.10, and despite installing it from source, it didn't work in the end. The current issue I’m facing is how to avoid NaN errors for normal multi-GPU operation. Is the problem related to the GPUs? Do I need to switch to four V100s for training?

ZhouHongLiang6 commented 1 month ago

Another thing you could try is to monitor loss, and step into pudb at the first moment it becomes NaN, and trace back where the NaN happens initially.

I'll try your suggestion.

mlzxy commented 1 month ago

xformers 0.0.11 is kind of old. I am using the 0.0.18 version and pytorch=1.13.1. Installing from source does work, as long as you set MAX_JOBS to be 1, like MAX_JOBS=1 CUDA_HOME=... pip install -e ..

ZhouHongLiang6 commented 1 month ago

xformers 0.0.11 is kind of old. I am using the 0.0.18 version and pytorch=1.13.1. Installing from source does work, as long as you set MAX_JOBS to be 1, like MAX_JOBS=1 CUDA_HOME=... pip install -e ..

Thank you, I will go and put your suggestion into practice.

ZhouHongLiang6 commented 1 month ago

xformers 0.0.11 is kind of old. I am using the 0.0.18 version and pytorch=1.13.1. Installing from source does work, as long as you set MAX_JOBS to be 1, like MAX_JOBS=1 CUDA_HOME=... pip install -e ..

Hi @mlzxy , Following your advice, I resolved the NaN and Inf issues. However, after installing xformers from the source, it still shows that xformers is unavailable during model training. Below is the installation process I followed: !cd /root/autodl-tmp/devit-main/xformers/xformers !echo 'export CUDA_HOME=/usr/local/cuda' >> ~/.bashrc !source ~/.bashrc !MAX_JOBS=1 CUDA_HOME=/usr/local/cuda pip install -e .

Yneng commented 1 month ago

xformers 0.0.11 is kind of old. I am using the 0.0.18 version and pytorch=1.13.1. Installing from source does work, as long as you set MAX_JOBS to be 1, like MAX_JOBS=1 CUDA_HOME=... pip install -e ..

Hi @mlzxy , Following your advice, I resolved the NaN and Inf issues. However, after installing xformers from the source, it still shows that xformers is unavailable during model training. Below is the installation process I followed: !cd /root/autodl-tmp/devit-main/xformers/xformers !echo 'export CUDA_HOME=/usr/local/cuda' >> ~/.bashrc !source ~/.bashrc !MAX_JOBS=1 CUDA_HOME=/usr/local/cuda pip install -e .

You might want to try the following command. Although I'm not familiar with xFormers, it seems to have worked: ！pip uninstall xformers ！pip install ninja ！pip install -v -U git+https://github.com/facebookresearch/xformers.git@v0.0.18#egg=xformers

mlzxy commented 1 month ago

I think DINOv2 ViT checks xformers in multiple places such as in swiglu_ffn https://github.com/mlzxy/devit/blob/main/lib/dinov2/layers/swiglu_ffn.py and attention https://github.com/mlzxy/devit/blob/main/lib/dinov2/layers/attention.py. You can check which operation is missing in your xformers. But as long as the memory_efficient_attention is available, others do not matter that much because attention costs the most amount of memory in a ViT.

On Oct 22, 2024 at 2:00:26 AM, ZhouHongLiang6 @.***> wrote:

xformers 0.0.11 is kind of old. I am using the 0.0.18 version and pytorch=1.13.1. Installing from source does work, as long as you set MAX_JOBS to be 1, like MAX_JOBS=1 CUDA_HOME=... pip install -e ..

Hi @mlzxy https://github.com/mlzxy , Following your advice, I resolved the NaN and Inf issues. However, after installing xformers from the source, it still shows that xformers is unavailable during model training. Below is the installation process I followed: !cd /root/autodl-tmp/devit-main/xformers/xformers !echo 'export CUDA_HOME=/usr/local/cuda' >> ~/.bashrc !source ~/.bashrc !MAX_JOBS=1 CUDA_HOME=/usr/local/cuda pip install -e .

— Reply to this email directly, view it on GitHub https://github.com/mlzxy/devit/issues/67#issuecomment-2428326697, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2OBPVK42SBY3HVCQHJH73Z4XSXVAVCNFSM6AAAAABQGKQGDWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMRYGMZDMNRZG4 . You are receiving this because you were mentioned.Message ID: @.***>

ZhouHongLiang6 commented 1 month ago

xformers 0.0.11 is kind of old. I am using the 0.0.18 version and pytorch=1.13.1. Installing from source does work, as long as you set MAX_JOBS to be 1, like MAX_JOBS=1 CUDA_HOME=... pip install -e ..

Hi @mlzxy , Following your advice, I resolved the NaN and Inf issues. However, after installing xformers from the source, it still shows that xformers is unavailable during model training. Below is the installation process I followed: !cd /root/autodl-tmp/devit-main/xformers/xformers !echo 'export CUDA_HOME=/usr/local/cuda' >> ~/.bashrc !source ~/.bashrc !MAX_JOBS=1 CUDA_HOME=/usr/local/cuda pip install -e .

You might want to try the following command. Although I'm not familiar with xFormers, it seems to have worked: ！pip uninstall xformers ！pip install ninja ！pip install -v -U git+https://github.com/facebookresearch/xformers.git@v0.0.18#egg=xformers

Thank you for your help. I will try it soon.

ZhouHongLiang6 commented 1 month ago

I think DINOv2 ViT checks xformers in multiple places such as in swiglu_ffn https://github.com/mlzxy/devit/blob/main/lib/dinov2/layers/swiglu_ffn.py and attention https://github.com/mlzxy/devit/blob/main/lib/dinov2/layers/attention.py. You can check which operation is missing in your xformers. But as long as the memory_efficient_attention is available, others do not matter that much because attention costs the most amount of memory in a ViT. … On Oct 22, 2024 at 2:00:26 AM, ZhouHongLiang6 @.> wrote: xformers 0.0.11 is kind of old. I am using the 0.0.18 version and pytorch=1.13.1. Installing from source does work, as long as you set MAX_JOBS to be 1, like MAX_JOBS=1 CUDA_HOME=... pip install -e .. Hi @mlzxy https://github.com/mlzxy , Following your advice, I resolved the NaN and Inf issues. However, after installing xformers from the source, it still shows that xformers is unavailable during model training. Below is the installation process I followed: !cd /root/autodl-tmp/devit-main/xformers/xformers !echo 'export CUDA_HOME=/usr/local/cuda' >> ~/.bashrc !source ~/.bashrc !MAX_JOBS=1 CUDA_HOME=/usr/local/cuda pip install -e . — Reply to this email directly, view it on GitHub <#67 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2OBPVK42SBY3HVCQHJH73Z4XSXVAVCNFSM6AAAAABQGKQGDWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMRYGMZDMNRZG4 . You are receiving this because you were mentioned.Message ID: @.>

Thank you for your help and patience. I'll go modify the code and debug it to check if the attention mechanism is functioning properly.

mlzxy / devit

NaN and Inf errors #67