ros-industrial / easy_perception_deployment

A ROS2 package that accelerates the training and deployment of CV models for industries.
Apache License 2.0
53 stars 16 forks source link

Cannot start training #27

Closed Tangetb closed 2 years ago

Tangetb commented 2 years ago

`[p3_trainer] - FOUND. Skipping installation. fatal: destination path 'maskrcnn-benchmark' already exists and is not an empty directory. mkdir: cannot create directory ‘weights’: File exists key: module.roi_heads.box.predictor.cls_score.weight is removed key: module.roi_heads.box.predictor.cls_score.bias is removed key: module.roi_heads.box.predictor.bbox_pred.weight is removed key: module.roi_heads.box.predictor.bbox_pred.bias is removed key: module.roi_heads.mask.predictor.mask_fcn_logits.weight is removed key: module.roi_heads.mask.predictor.mask_fcn_logits.bias is removed Also deleting optimizer, scheduler, and iteration entries saved to: e2e_mask_rcnn_R_50_FPN_1x_trimmed.pth mkdir: cannot create directory ‘datasets’: File exists TrainFarm created under /home/tan/epd_ros2_ws/src/easy_perception_deployment/easy_perception_deployment/gui/trainer/P3TrainFarm 2021-10-01 18:19:27,769 maskrcnn_benchmark INFO: Using 1 GPUs 2021-10-01 18:19:27,769 maskrcnn_benchmark INFO: Namespace(config_file='configs/custom/maskrcnn_training.yaml', distributed=False, local_rank=0, opts=[], skip_test=False) 2021-10-01 18:19:27,769 maskrcnn_benchmark INFO: Collecting env info (might take some time) 2021-10-01 18:19:28,996 maskrcnn_benchmark INFO: PyTorch version: 1.2.0 Is debug build: No CUDA used to build PyTorch: 10.0.130

OS: Ubuntu 20.04.3 LTS GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CMake version: version 3.21.3

Python version: 3.6 Is CUDA available: Yes CUDA runtime version: Could not collect GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3070 Nvidia driver version: 470.63.01 cuDNN version: Could not collect

Versions of relevant libraries: [pip3] numpy==1.19.5 [pip3] torch==1.2.0 [pip3] torchvision==0.4.0a0 [conda] _pytorch_select 0.2 gpu_0
[conda] blas 1.0 mkl
[conda] mkl 2020.2 256
[conda] mkl-service 2.3.0 py36he8ac12f_0
[conda] mkl_fft 1.3.0 py36h54f3939_0
[conda] mkl_random 1.1.1 py36h0573a6f_0
[conda] pytorch 1.2.0 cuda100py36h938c94c_0
[conda] torchvision 0.4.0 cuda100py36hecfc37a_0 Pillow (8.3.1) 2021-10-01 18:19:28,996 maskrcnn_benchmark INFO: Loaded configuration file configs/custom/maskrcnn_training.yaml 2021-10-01 18:19:28,996 maskrcnn_benchmark INFO: MODEL: META_ARCHITECTURE: "GeneralizedRCNN" WEIGHT: "weights/e2e_mask_rcnn_R_50_FPN_1x_trimmed.pth" BACKBONE: CONV_BODY: "R-50-FPN" FREEZE_CONV_BODY_AT: 2 #Freeze layers RESNETS: BACKBONE_OUT_CHANNELS: 256 RPN: USE_FPN: True ANCHOR_STRIDE: (4, 8, 16, 32, 64) PRE_NMS_TOP_N_TRAIN: 2000 PRE_NMS_TOP_N_TEST: 1000 POST_NMS_TOP_N_TEST: 1000 FPN_POST_NMS_TOP_N_TEST: 1000 ROI_HEADS: USE_FPN: True ROI_BOX_HEAD: POOLER_RESOLUTION: 7 POOLER_SCALES: (0.25, 0.125, 0.0625, 0.03125) POOLER_SAMPLING_RATIO: 2 FEATURE_EXTRACTOR: "FPN2MLPFeatureExtractor" PREDICTOR: "FPNPredictor" NUM_CLASSES: 4 #Change to your number of objects +2 ROI_MASK_HEAD: POOLER_SCALES: (0.25, 0.125, 0.0625, 0.03125) FEATURE_EXTRACTOR: "MaskRCNNFPNFeatureExtractor" PREDICTOR: "MaskRCNNC4Predictor" POOLER_RESOLUTION: 14 POOLER_SAMPLING_RATIO: 2 RESOLUTION: 28 SHARE_BOX_FEATURE_EXTRACTOR: False MASK_ON: True DATASETS: TEST: ("coco_custom_val",) TRAIN: ("coco_custom_train",) DATALOADER: SIZE_DIVISIBILITY: 32 SOLVER: BASE_LR: 0.001 # learning rate after warmup WEIGHT_DECAY: 0.0001 STEPS: (1000,1500,2000,2500) # where you want your learning rate to decrease MAX_ITER: 3000 #number of iteration WARMUP_ITERS: 500 IMS_PER_BATCH: 1 TEST_PERIOD: 200 #run validation every steps CHECKPOINT_PERIOD: 200 #save model every steps TEST: IMS_PER_BATCH: 1 OUTPUT_DIR: "./weights/custom/" # saved weight output directory

2021-10-01 18:19:28,996 maskrcnn_benchmark INFO: Running with config: AMP_VERBOSE: False DATALOADER: ASPECT_RATIO_GROUPING: True NUM_WORKERS: 4 SIZE_DIVISIBILITY: 32 DATASETS: TEST: ('coco_custom_val',) TRAIN: ('coco_custom_train',) DTYPE: float32 INPUT: BRIGHTNESS: 0.0 CONTRAST: 0.0 HORIZONTAL_FLIP_PROB_TRAIN: 0.5 HUE: 0.0 MAX_SIZE_TEST: 1333 MAX_SIZE_TRAIN: 1333 MIN_SIZE_TEST: 800 MIN_SIZE_TRAIN: (800,) PIXEL_MEAN: [102.9801, 115.9465, 122.7717] PIXEL_STD: [1.0, 1.0, 1.0] SATURATION: 0.0 TO_BGR255: True VERTICAL_FLIP_PROB_TRAIN: 0.0 MODEL: BACKBONE: CONV_BODY: R-50-FPN FREEZE_CONV_BODY_AT: 2 CLS_AGNOSTIC_BBOX_REG: False DEVICE: cuda FBNET: ARCH: default ARCH_DEF: BN_TYPE: bn DET_HEAD_BLOCKS: [] DET_HEAD_LAST_SCALE: 1.0 DET_HEAD_STRIDE: 0 DW_CONV_SKIP_BN: True DW_CONV_SKIP_RELU: True KPTS_HEAD_BLOCKS: [] KPTS_HEAD_LAST_SCALE: 0.0 KPTS_HEAD_STRIDE: 0 MASK_HEAD_BLOCKS: [] MASK_HEAD_LAST_SCALE: 0.0 MASK_HEAD_STRIDE: 0 RPN_BN_TYPE: RPN_HEAD_BLOCKS: 0 SCALE_FACTOR: 1.0 WIDTH_DIVISOR: 1 FPN: USE_GN: False USE_RELU: False GROUP_NORM: DIM_PER_GP: -1 EPSILON: 1e-05 NUM_GROUPS: 32 KEYPOINT_ON: False MASK_ON: True META_ARCHITECTURE: GeneralizedRCNN RESNETS: BACKBONE_OUT_CHANNELS: 256 DEFORMABLE_GROUPS: 1 NUM_GROUPS: 1 RES2_OUT_CHANNELS: 256 RES5_DILATION: 1 STAGE_WITH_DCN: (False, False, False, False) STEM_FUNC: StemWithFixedBatchNorm STEM_OUT_CHANNELS: 64 STRIDE_IN_1X1: True TRANS_FUNC: BottleneckWithFixedBatchNorm WIDTH_PER_GROUP: 64 WITH_MODULATED_DCN: False RETINANET: ANCHOR_SIZES: (32, 64, 128, 256, 512) ANCHOR_STRIDES: (8, 16, 32, 64, 128) ASPECT_RATIOS: (0.5, 1.0, 2.0) BBOX_REG_BETA: 0.11 BBOX_REG_WEIGHT: 4.0 BG_IOU_THRESHOLD: 0.4 FG_IOU_THRESHOLD: 0.5 INFERENCE_TH: 0.05 LOSS_ALPHA: 0.25 LOSS_GAMMA: 2.0 NMS_TH: 0.4 NUM_CLASSES: 81 NUM_CONVS: 4 OCTAVE: 2.0 PRE_NMS_TOP_N: 1000 PRIOR_PROB: 0.01 SCALES_PER_OCTAVE: 3 STRADDLE_THRESH: 0 USE_C5: True RETINANET_ON: False ROI_BOX_HEAD: CONV_HEAD_DIM: 256 DILATION: 1 FEATURE_EXTRACTOR: FPN2MLPFeatureExtractor MLP_HEAD_DIM: 1024 NUM_CLASSES: 4 NUM_STACKED_CONVS: 4 POOLER_RESOLUTION: 7 POOLER_SAMPLING_RATIO: 2 POOLER_SCALES: (0.25, 0.125, 0.0625, 0.03125) PREDICTOR: FPNPredictor USE_GN: False ROI_HEADS: BATCH_SIZE_PER_IMAGE: 512 BBOX_REG_WEIGHTS: (10.0, 10.0, 5.0, 5.0) BG_IOU_THRESHOLD: 0.5 DETECTIONS_PER_IMG: 100 FG_IOU_THRESHOLD: 0.5 NMS: 0.5 POSITIVE_FRACTION: 0.25 SCORE_THRESH: 0.05 USE_FPN: True ROI_KEYPOINT_HEAD: CONV_LAYERS: (512, 512, 512, 512, 512, 512, 512, 512) FEATURE_EXTRACTOR: KeypointRCNNFeatureExtractor MLP_HEAD_DIM: 1024 NUM_CLASSES: 17 POOLER_RESOLUTION: 14 POOLER_SAMPLING_RATIO: 0 POOLER_SCALES: (0.0625,) PREDICTOR: KeypointRCNNPredictor RESOLUTION: 14 SHARE_BOX_FEATURE_EXTRACTOR: True ROI_MASK_HEAD: CONV_LAYERS: (256, 256, 256, 256) DILATION: 1 FEATURE_EXTRACTOR: MaskRCNNFPNFeatureExtractor MLP_HEAD_DIM: 1024 POOLER_RESOLUTION: 14 POOLER_SAMPLING_RATIO: 2 POOLER_SCALES: (0.25, 0.125, 0.0625, 0.03125) POSTPROCESS_MASKS: False POSTPROCESS_MASKS_THRESHOLD: 0.5 PREDICTOR: MaskRCNNC4Predictor RESOLUTION: 28 SHARE_BOX_FEATURE_EXTRACTOR: False USE_GN: False RPN: ANCHOR_SIZES: (32, 64, 128, 256, 512) ANCHOR_STRIDE: (4, 8, 16, 32, 64) ASPECT_RATIOS: (0.5, 1.0, 2.0) BATCH_SIZE_PER_IMAGE: 256 BG_IOU_THRESHOLD: 0.3 FG_IOU_THRESHOLD: 0.7 FPN_POST_NMS_PER_BATCH: True FPN_POST_NMS_TOP_N_TEST: 1000 FPN_POST_NMS_TOP_N_TRAIN: 2000 MIN_SIZE: 0 NMS_THRESH: 0.7 POSITIVE_FRACTION: 0.5 POST_NMS_TOP_N_TEST: 1000 POST_NMS_TOP_N_TRAIN: 2000 PRE_NMS_TOP_N_TEST: 1000 PRE_NMS_TOP_N_TRAIN: 2000 RPN_HEAD: SingleConvRPNHead STRADDLE_THRESH: 0 USE_FPN: True RPN_ONLY: False WEIGHT: weights/e2e_mask_rcnn_R_50_FPN_1x_trimmed.pth OUTPUT_DIR: ./weights/custom/ PATHS_CATALOG: /home/tan/anaconda3/envs/p3_trainer/lib/python3.6/site-packages/maskrcnn_benchmark-0.1-py3.6-linux-x86_64.egg/maskrcnn_benchmark/config/paths_catalog.py SOLVER: BASE_LR: 0.001 BIAS_LR_FACTOR: 2 CHECKPOINT_PERIOD: 200 GAMMA: 0.1 IMS_PER_BATCH: 1 MAX_ITER: 3000 MOMENTUM: 0.9 STEPS: (1000, 1500, 2000, 2500) TEST_PERIOD: 200 WARMUP_FACTOR: 0.3333333333333333 WARMUP_ITERS: 500 WARMUP_METHOD: linear WEIGHT_DECAY: 0.0001 WEIGHT_DECAY_BIAS: 0 TEST: BBOX_AUG: ENABLED: False H_FLIP: False MAX_SIZE: 4000 SCALES: () SCALE_H_FLIP: False DETECTIONS_PER_IMG: 100 EXPECTED_RESULTS: [] EXPECTED_RESULTS_SIGMA_TOL: 4 IMS_PER_BATCH: 1 2021-10-01 18:19:28,996 maskrcnn_benchmark INFO: Saving config into: ./weights/custom/config.yml`

The program stuck here for long time after pressing Train. Install for only CPU dep for ONNX.

cardboardcode commented 2 years ago

Hi @Tangetb,

Please provide the terminal outputs after running the following commands:

nvidia-smi
nvcc --version

This should provide the GPU Nvidia driver and CUDA installation information. The GPU is used for Training and could be the reason why it is stuck at where it is.

Based on the following excerpt from the output provided above:

cuDNN version: Could not collect

It seems like CUDNN is not installed properly on your workstation. Please install CUDNN 7.6.5. Install following these instructions.

Tangetb commented 2 years ago

+-----------------------------------------------------------------------------+ | NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: 11.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 GeForce RTX 3070 Off | 00000000:01:00.0 On | N/A | | 0% 39C P2 46W / 220W | 622MiB / 7979MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1120 G /usr/lib/xorg/Xorg 35MiB | | 0 N/A N/A 1832 G /usr/lib/xorg/Xorg 127MiB | | 0 N/A N/A 1961 G /usr/bin/gnome-shell 110MiB | | 0 N/A N/A 2295 G /usr/lib/firefox/firefox 155MiB | | 0 N/A N/A 2558 G /usr/lib/firefox/firefox 3MiB | | 0 N/A N/A 2674 G /usr/lib/firefox/firefox 3MiB | | 0 N/A N/A 2744 G /usr/lib/firefox/firefox 3MiB | | 0 N/A N/A 2923 G /usr/lib/firefox/firefox 3MiB | | 0 N/A N/A 3012 G /usr/lib/firefox/firefox 3MiB | | 0 N/A N/A 3262 C python3 159MiB | +-----------------------------------------------------------------------------+

nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2019 NVIDIA Corporation Built on Sun_Jul_28_19:07:16_PDT_2019 Cuda compilation tools, release 10.1, V10.1.243

I tried using the following commands to install CUDNN, but it seems like it still could not detect CUDNN when running EPD. $ sudo cp cuda/include/cudnn*.h /usr/local/cuda/include $ sudo cp -P cuda/lib64/libcudnn* /usr/local/cuda/lib64 $ sudo chmod a+r /usr/local/cuda/include/cudnn*.h /usr/local/cuda/lib64/libcudnn*

cardboardcode commented 2 years ago

Noted @Tangetb. Thanks for the info.

Note that EPD is not tested with CUDA 10.1 (There won't be plans to do so since it is an older version of CUDA).

:warning:

Would recommend to install CUDA 10.2 and CUDNN 7.6.5 which are verified. It should also work with the Nvidia Driver version you current have (460.91.03).

In the meantime, from the information you provided, there does not seem to be any issue with the Nvidia Driver, CUDA 10.1 and CUDNN you installed.

Can you provide the terminal output of the following command:

grep /usr/local/cuda/include/ | grep cudnn

Would like to check the CUDNN version you have. It could be possible that the CUDNN installed is a version incompatible with CUDA 10.1. Please refer to the CUDA-CUDNN Support Matrix for more details.

cardboardcode commented 2 years ago

Closing due to inactivity.