google-research / ssl_detection

Semi-supervised learning for object detection
Apache License 2.0
408 stars 76 forks source link

dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory #8

Closed vaslamp closed 4 years ago

vaslamp commented 4 years ago

Can you please interpret me the following error? Is it a problem with CUDA version? I am not that much experienced and I would like to know so that I can solve it and continue.

WARNING: NVIDIA binaries may not be bound with --writable [0706 13:49:52 @voc.py:279] Register dataset ['VOC2007/instances_trainval', 'VOC2007/instances_test', 'VOC2012/instances_trainval'] [0706 13:49:52 @coco.py:271] Register dataset ['VOC2007/instances_trainval', 'VOC2007/instances_test', 'VOC2012/instances_trainval', 'train2017', 'val2017', 'coco_train2017', 'coco_val2017', 'coco_train2014', 'coco_val2014', 'coco_valminusminival2014', 'coco_minival2014', 'coco_val2017_100'] [0706 13:49:52 @coco.py:205] Register dataset ['VOC2007/instances_trainval', 'VOC2007/instances_test', 'VOC2012/instances_trainval', 'train2017', 'val2017', 'coco_train2017', 'coco_val2017', 'coco_train2014', 'coco_val2014', 'coco_valminusminival2014', 'coco_minival2014', 'coco_val2017_100', 'coco_train2017.1@1', 'coco_train2017.1@1-unlabeled', 'coco_train2017.1@2', 'coco_train2017.1@2-unlabeled', 'coco_train2017.1@5', 'coco_train2017.1@5-unlabeled', 'coco_train2017.1@10', 'coco_train2017.1@10-unlabeled', 'coco_train2017.1@20', 'coco_train2017.1@20-unlabeled', 'coco_train2017.1@30', 'coco_train2017.1@30-unlabeled', 'coco_train2017.1@40', 'coco_train2017.1@40-unlabeled', 'coco_train2017.1@50', 'coco_train2017.1@50-unlabeled', 'coco_train2017.2@1', 'coco_train2017.2@1-unlabeled', 'coco_train2017.2@2', 'coco_train2017.2@2-unlabeled', 'coco_train2017.2@5', 'coco_train2017.2@5-unlabeled', 'coco_train2017.2@10', 'coco_train2017.2@10-unlabeled', 'coco_train2017.2@20', 'coco_train2017.2@20-unlabeled', 'coco_train2017.2@30', 'coco_train2017.2@30-unlabeled', 'coco_train2017.2@40', 'coco_train2017.2@40-unlabeled', 'coco_train2017.2@50', 'coco_train2017.2@50-unlabeled', 'coco_train2017.3@1', 'coco_train2017.3@1-unlabeled', 'coco_train2017.3@2', 'coco_train2017.3@2-unlabeled', 'coco_train2017.3@5', 'coco_train2017.3@5-unlabeled', 'coco_train2017.3@10', 'coco_train2017.3@10-unlabeled', 'coco_train2017.3@20', 'coco_train2017.3@20-unlabeled', 'coco_train2017.3@30', 'coco_train2017.3@30-unlabeled', 'coco_train2017.3@40', 'coco_train2017.3@40-unlabeled', 'coco_train2017.3@50', 'coco_train2017.3@50-unlabeled', 'coco_train2017.4@1', 'coco_train2017.4@1-unlabeled', 'coco_train2017.4@2', 'coco_train2017.4@2-unlabeled', 'coco_train2017.4@5', 'coco_train2017.4@5-unlabeled', 'coco_train2017.4@10', 'coco_train2017.4@10-unlabeled', 'coco_train2017.4@20', 'coco_train2017.4@20-unlabeled', 'coco_train2017.4@30', 'coco_train2017.4@30-unlabeled', 'coco_train2017.4@40', 'coco_train2017.4@40-unlabeled', 'coco_train2017.4@50', 'coco_train2017.4@50-unlabeled', 'coco_train2017.5@1', 'coco_train2017.5@1-unlabeled', 'coco_train2017.5@2', 'coco_train2017.5@2-unlabeled', 'coco_train2017.5@5', 'coco_train2017.5@5-unlabeled', 'coco_train2017.5@10', 'coco_train2017.5@10-unlabeled', 'coco_train2017.5@20', 'coco_train2017.5@20-unlabeled', 'coco_train2017.5@30', 'coco_train2017.5@30-unlabeled', 'coco_train2017.5@40', 'coco_train2017.5@40-unlabeled', 'coco_train2017.5@50', 'coco_train2017.5@50-unlabeled', 'coco_train2017.0@100-extra', 'coco_train2017.0@100-extra-unlabeled', 'coco_unlabeled2017'] [0706 13:49:52 @coco.py:260] Register dataset ['VOC2007/instances_trainval', 'VOC2007/instances_test', 'VOC2012/instances_trainval', 'train2017', 'val2017', 'coco_train2017', 'coco_val2017', 'coco_train2014', 'coco_val2014', 'coco_valminusminival2014', 'coco_minival2014', 'coco_val2017_100', 'coco_train2017.1@1', 'coco_train2017.1@1-unlabeled', 'coco_train2017.1@2', 'coco_train2017.1@2-unlabeled', 'coco_train2017.1@5', 'coco_train2017.1@5-unlabeled', 'coco_train2017.1@10', 'coco_train2017.1@10-unlabeled', 'coco_train2017.1@20', 'coco_train2017.1@20-unlabeled', 'coco_train2017.1@30', 'coco_train2017.1@30-unlabeled', 'coco_train2017.1@40', 'coco_train2017.1@40-unlabeled', 'coco_train2017.1@50', 'coco_train2017.1@50-unlabeled', 'coco_train2017.2@1', 'coco_train2017.2@1-unlabeled', 'coco_train2017.2@2', 'coco_train2017.2@2-unlabeled', 'coco_train2017.2@5', 'coco_train2017.2@5-unlabeled', 'coco_train2017.2@10', 'coco_train2017.2@10-unlabeled', 'coco_train2017.2@20', 'coco_train2017.2@20-unlabeled', 'coco_train2017.2@30', 'coco_train2017.2@30-unlabeled', 'coco_train2017.2@40', 'coco_train2017.2@40-unlabeled', 'coco_train2017.2@50', 'coco_train2017.2@50-unlabeled', 'coco_train2017.3@1', 'coco_train2017.3@1-unlabeled', 'coco_train2017.3@2', 'coco_train2017.3@2-unlabeled', 'coco_train2017.3@5', 'coco_train2017.3@5-unlabeled', 'coco_train2017.3@10', 'coco_train2017.3@10-unlabeled', 'coco_train2017.3@20', 'coco_train2017.3@20-unlabeled', 'coco_train2017.3@30', 'coco_train2017.3@30-unlabeled', 'coco_train2017.3@40', 'coco_train2017.3@40-unlabeled', 'coco_train2017.3@50', 'coco_train2017.3@50-unlabeled', 'coco_train2017.4@1', 'coco_train2017.4@1-unlabeled', 'coco_train2017.4@2', 'coco_train2017.4@2-unlabeled', 'coco_train2017.4@5', 'coco_train2017.4@5-unlabeled', 'coco_train2017.4@10', 'coco_train2017.4@10-unlabeled', 'coco_train2017.4@20', 'coco_train2017.4@20-unlabeled', 'coco_train2017.4@30', 'coco_train2017.4@30-unlabeled', 'coco_train2017.4@40', 'coco_train2017.4@40-unlabeled', 'coco_train2017.4@50', 'coco_train2017.4@50-unlabeled', 'coco_train2017.5@1', 'coco_train2017.5@1-unlabeled', 'coco_train2017.5@2', 'coco_train2017.5@2-unlabeled', 'coco_train2017.5@5', 'coco_train2017.5@5-unlabeled', 'coco_train2017.5@10', 'coco_train2017.5@10-unlabeled', 'coco_train2017.5@20', 'coco_train2017.5@20-unlabeled', 'coco_train2017.5@30', 'coco_train2017.5@30-unlabeled', 'coco_train2017.5@40', 'coco_train2017.5@40-unlabeled', 'coco_train2017.5@50', 'coco_train2017.5@50-unlabeled', 'coco_train2017.0@100-extra', 'coco_train2017.0@100-extra-unlabeled', 'coco_unlabeled2017', 'coco_unlabeledtrainval20class'] [0706 13:49:52 @logger.py:138] Directory '/home/vlamp/Documents/STAC/RESULTS' backuped to '/home/vlamp/Documents/STAC/RESULTS0706-134952' [0706 13:49:52 @logger.py:92] Argv: /home/vlamp/Documents/STAC/detection/train_stg1_bdd.py --logdir /home/vlamp/Documents/STAC/RESULTS/ --simple_path --config BACKBONE.WEIGHTS=/home/vlamp/Documents/STAC/DATA_STAC/coco/ImageNet-R50-AlignPadding.npz DATA.BASEDIR=/home/vlamp/Documents/STAC/DATA_STAC/coco MODE_MASK=False FRCNN.BATCH_PER_IM=64 PREPROC.TRAIN_SHORT_EDGE_SIZE=[500,800] TRAIN.EVAL_PERIOD=20 TRAIN.AUGTYPE_LAB=default [0706 13:49:54 @train_stg1_bdd.py:87] Environment Information:


sys.platform linux Python 3.6.9 (default, Apr 18 2020, 01:56:04) [GCC 8.4.0] Tensorpack v0.10.1-9-g9c1b1b7b-dirty Numpy 1.16.4 TensorFlow 1.14.0/v1.14.0-rc1-22-gaf24dc91b5 TF Compiler Version 4.8.5 TF CUDA support True TF MKL support False TF XLA support False Nvidia Driver /.singularity.d/libs/libnvidia-ml.so CUDA /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudart.so.10.1.243 CUDNN /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.4 NCCL CUDA_VISIBLE_DEVICES 0,1 GPU 0,1 Tesla T4 Free RAM 369.15/376.54 GB CPU Count 40 cv2 4.2.0 msgpack 1.0.0 python-prctl False


list(_C.DATA.TRAIN) = ['train2017'] list(_C.DATA.VAL) = ('val2017',) datasets = ['train2017', 'val2017'] _C.DATA.CLASS_NAMES = ['BG', 'car', 'pedestrian', 'big vehicle', 'bicycle', 'motorcycle'] [0706 13:49:54 @config.py:352] Config: ------------------------------------------ {'BACKBONE': {'FREEZE_AFFINE': False, 'FREEZE_AT': 2, 'NORM': 'FreezeBN', 'RESNET_NUM_BLOCKS': [3, 4, 6, 3], 'STRIDE_1X1': False, 'TF_PAD_MODE': False, 'WEIGHTS': '/home/vlamp/Documents/STAC/DATA_STAC/coco/ImageNet-R50-AlignPadding.npz'}, 'CASCADE': {'BBOX_REG_WEIGHTS': [[10.0, 10.0, 5.0, 5.0], [20.0, 20.0, 10.0, 10.0], [30.0, 30.0, 15.0, 15.0]], 'IOUS': [0.5, 0.6, 0.7]}, 'DATA': {'ABSOLUTE_COORD': True, 'BASEDIR': '/home/vlamp/Documents/STAC/DATA_STAC/coco', 'CLASS_NAMES': ['BG', 'car', 'pedestrian', 'big vehicle', 'bicycle', 'motorcycle'], 'NUM_CATEGORY': 5, 'NUM_WORKERS': 24, 'TRAIN': ('train2017',), 'UNLABEL': ('',), 'VAL': ('val2017',)}, 'EVAL': {'PSEUDO_INFERENCE': False}, 'FPN': {'ANCHOR_SIZES': (32, 64, 128, 256, 512), 'ANCHOR_STRIDES': (4, 8, 16, 32, 64), 'CASCADE': False, 'FRCNN_CONV_HEAD_DIM': 256, 'FRCNN_FC_HEAD_DIM': 1024, 'FRCNN_HEAD_FUNC': 'fastrcnn_2fc_head', 'MRCNN_HEAD_FUNC': 'maskrcnn_up4conv_head', 'NORM': 'None', 'NUM_CHANNEL': 256, 'PROPOSAL_MODE': 'Level', 'RESOLUTION_REQUIREMENT': 32}, 'FRCNN': {'BATCH_PER_IM': 64, 'BBOX_REG_WEIGHTS': [10.0, 10.0, 5.0, 5.0], 'FG_RATIO': 0.25, 'FG_THRESH': 0.5}, 'MODE_FPN': True, 'MODE_MASK': False, 'MRCNN': {'ACCURATE_PASTE': True, 'HEAD_DIM': 256}, 'PREPROC': {'MAX_SIZE': 1344.0, 'PIXEL_MEAN': [123.675, 116.28, 103.53], 'PIXEL_STD': [58.395, 57.12, 57.375], 'TEST_SHORT_EDGE_SIZE': 800, 'TRAIN_SHORT_EDGE_SIZE': [500, 800]}, 'RPN': {'ANCHOR_RATIOS': (0.5, 1.0, 2.0), 'ANCHOR_SIZES': (32, 64, 128, 256, 512), 'ANCHOR_STRIDE': 16, 'BATCH_PER_IM': 256, 'CROWD_OVERLAP_THRESH': 9.99, 'FG_RATIO': 0.5, 'HEAD_DIM': 1024, 'MIN_SIZE': 0, 'NEGATIVE_ANCHOR_THRESH': 0.3, 'NUM_ANCHOR': 15, 'POSITIVE_ANCHOR_THRESH': 0.7, 'PROPOSAL_NMS_THRESH': 0.7, 'TEST_PER_LEVEL_NMS_TOPK': 1000, 'TEST_POST_NMS_TOPK': 1000, 'TEST_PRE_NMS_TOPK': 6000, 'TRAIN_PER_LEVEL_NMS_TOPK': 2000, 'TRAIN_POST_NMS_TOPK': 2000, 'TRAIN_PRE_NMS_TOPK': 12000}, 'TEST': {'FRCNN_NMS_THRESH': 0.5, 'RESULTS_PER_IM': 100, 'RESULT_SCORE_THRESH': 0.05, 'RESULT_SCORE_THRESH_VIS': 0.5}, 'TRAIN': {'AUGTYPE': 'strong', 'AUGTYPE_LAB': 'default', 'BASE_LR': 0.01, 'CHECKPOINT_PERIOD': 20, 'CONFIDENCE': 0.9, 'EVAL_PERIOD': 20, 'GAMMA': 0.1, 'LR_SCHEDULE': [120000, 160000, 180000], 'NO_PRN_LOSS': False, 'NUM_GPUS': 2, 'STAGE': 1, 'STARTING_EPOCH': 1, 'STEPS_PER_EPOCH': 500, 'WARMUP': 1000, 'WARMUP_INIT_LR': 0.0033000000000000004, 'WEIGHT_DECAY': 0.0001, 'WU': 2.0}, 'TRAINER': 'replicated'} [0706 13:49:54 @train_stg1_bdd.py:106] Warm Up Schedule (steps, value): [(0, 0.0033000000000000004), (1000, 0.01)] [0706 13:49:54 @train_stg1_bdd.py:107] LR Schedule (epochs, value): [(2, 0.01), (960.0, 0.001), (1280.0, 0.00010000000000000002)] loading annotations into memory... Done (t=5.18s) creating index... index created! [0706 13:49:59 @coco.py:60] Instances loaded from /home/vlamp/Documents/STAC/DATA_STAC/coco/annotations/instances_train2017.json.

0% 0/69403 [00:00<?, ?it/s] 3% 3 2090/69403 [00:00<00:03, 20895.19it/s] 6% 5 4034/69403 [00:00<00:03, 20434.79it/s] 9% 8 6073/69403 [00:00<00:03, 20416.41it/s] 12% #1 8201/69403 [00:00<00:02, 20666.09it/s] 15% #4 10336/69403 [00:00<00:02, 20866.20it/s] 18% #7 12465/69403 [00:00<00:02, 20991.31it/s] 21% ##1 14620/69403 [00:00<00:02, 21155.12it/s] 24% ##4 16775/69403 [00:00<00:02, 21271.79it/s] 27% ##7 18896/69403 [00:00<00:02, 21253.07it/s] 30% ### 21042/69403 [00:01<00:02, 21313.93it/s] 33% ###3 23115/69403 [00:01<00:02, 21052.23it/s] 36% ###6 25181/69403 [00:01<00:02, 20796.20it/s] 39% ###9 27234/69403 [00:01<00:02, 20696.98it/s] 42% ####2 29285/69403 [00:01<00:01, 20509.34it/s] 45% ####5 31323/69403 [00:01<00:01, 20425.01it/s] 48% ####8 33357/69403 [00:01<00:01, 20302.50it/s] 51% ##### 35382/69403 [00:01<00:01, 20251.87it/s] 54% #####3 37403/69403 [00:01<00:01, 20201.65it/s] 57% #####6 39488/69403 [00:01<00:01, 20390.27it/s] 60% #####9 41550/69403 [00:02<00:01, 20456.26it/s] 63% ######2 43660/69403 [00:02<00:01, 20643.18it/s] 66% ######5 45767/69403 [00:02<00:01, 20768.95it/s] 69% ######8 47887/69403 [00:02<00:01, 20894.81it/s] 72% #######2 50002/69403 [00:02<00:00, 20968.20it/s] 75% #######5 52146/69403 [00:02<00:00, 21105.63it/s] 78% #######8 54280/69403 [00:02<00:00, 21174.64it/s] 81% ########1 56406/69403 [00:02<00:00, 21198.35it/s] 84% ########4 58537/69403 [00:02<00:00, 21230.58it/s] 87% ########7 60701/69403 [00:02<00:00, 21351.07it/s] 91% ######### 62872/69403 [00:03<00:00, 21456.21it/s] 94% #########3 65018/69403 [00:03<00:00, 21151.33it/s] 97% #########6 67169/69403 [00:03<00:00, 21256.36it/s] 100% #########9 69342/69403 [00:03<00:00, 21396.14it/s] 100% ########## 69403/69403 [00:03<00:00, 20915.84it/s][0706 13:50:03 @timer.py:45] Load annotations for instances_train2017.json finished, time:3.3659 sec. [0706 13:50:05 @data.py:79] Ground-Truth category distribution:  class #box class #box class #box
car 713210 pedestrian 91349 big vehicle 41643
bicycle 7210 motorcycle 3002
total 856414 

[0706 13:50:05 @data.py:416] Filtered 0 images which contain no non-crowd groudtruth boxes. Total #images for training: 69403 [0706 13:50:05 @augmentation.py:171] ---------------------------------------------------------------------------------------------------- [0706 13:50:05 @augmentation.py:172] Augmentation type default: [] [0706 13:50:05 @augmentation.py:173] ---------------------------------------------------------------------------------------------------- [0706 13:50:05 @data.py:107] Use affine-enabled TrainingDataPreprocessor_aug [0706 13:50:05 @train_stg1_bdd.py:112] Total passes of the training set is: 20.748 [0706 13:50:05 @sessinit.py:294] Loading dictionary from /home/vlamp/Documents/STAC/DATA_STAC/coco/ImageNet-R50-AlignPadding.npz ... [0706 13:50:06 @training.py:48] [DataParallel] Training a model of 2 towers. [0706 13:50:06 @interface.py:41] Automatically applying StagingInput on the DataFlow. [0706 13:50:06 @input_source.py:221] Setting up the queue 'QueueInput/input_queue' for CPU prefetching ... [0706 13:50:06 @training.py:108] Building graph for training tower 0 on device /gpu:0 ... [0706 13:50:06 @argtools.py:138] WRN Some BatchNorm layer uses moving_mean/moving_variance in training. [0706 13:50:06 @registry.py:90] 'conv0': [1, 3, ?, ?] --> [1, 64, ?, ?] [0706 13:50:06 @registry.py:90] 'pool0': [1, 64, ?, ?] --> [1, 64, ?, ?] [0706 13:50:06 @registry.py:90] 'group0/block0/conv1': [1, 64, ?, ?] --> [1, 64, ?, ?] [0706 13:50:06 @registry.py:90] 'group0/block0/conv2': [1, 64, ?, ?] --> [1, 64, ?, ?] [0706 13:50:06 @registry.py:90] 'group0/block0/conv3': [1, 64, ?, ?] --> [1, 256, ?, ?] [0706 13:50:06 @registry.py:90] 'group0/block0/convshortcut': [1, 64, ?, ?] --> [1, 256, ?, ?] [0706 13:50:06 @registry.py:90] 'group0/block1/conv1': [1, 256, ?, ?] --> [1, 64, ?, ?] [0706 13:50:06 @registry.py:90] 'group0/block1/conv2': [1, 64, ?, ?] --> [1, 64, ?, ?] [0706 13:50:06 @registry.py:90] 'group0/block1/conv3': [1, 64, ?, ?] --> [1, 256, ?, ?] [0706 13:50:06 @registry.py:90] 'group0/block2/conv1': [1, 256, ?, ?] --> [1, 64, ?, ?] [0706 13:50:06 @registry.py:90] 'group0/block2/conv2': [1, 64, ?, ?] --> [1, 64, ?, ?] [0706 13:50:06 @registry.py:90] 'group0/block2/conv3': [1, 64, ?, ?] --> [1, 256, ?, ?] [0706 13:50:06 @registry.py:90] 'group1/block0/conv1': [1, 256, ?, ?] --> [1, 128, ?, ?] [0706 13:50:07 @registry.py:90] 'group1/block0/conv2': [1, 128, ?, ?] --> [1, 128, ?, ?] [0706 13:50:07 @registry.py:90] 'group1/block0/conv3': [1, 128, ?, ?] --> [1, 512, ?, ?] [0706 13:50:07 @registry.py:90] 'group1/block0/convshortcut': [1, 256, ?, ?] --> [1, 512, ?, ?] [0706 13:50:07 @registry.py:90] 'group1/block1/conv1': [1, 512, ?, ?] --> [1, 128, ?, ?] [0706 13:50:07 @registry.py:90] 'group1/block1/conv2': [1, 128, ?, ?] --> [1, 128, ?, ?] [0706 13:50:07 @registry.py:90] 'group1/block1/conv3': [1, 128, ?, ?] --> [1, 512, ?, ?] [0706 13:50:07 @registry.py:90] 'group1/block2/conv1': [1, 512, ?, ?] --> [1, 128, ?, ?] [0706 13:50:07 @registry.py:90] 'group1/block2/conv2': [1, 128, ?, ?] --> [1, 128, ?, ?] [0706 13:50:07 @registry.py:90] 'group1/block2/conv3': [1, 128, ?, ?] --> [1, 512, ?, ?] [0706 13:50:07 @registry.py:90] 'group1/block3/conv1': [1, 512, ?, ?] --> [1, 128, ?, ?] [0706 13:50:07 @registry.py:90] 'group1/block3/conv2': [1, 128, ?, ?] --> [1, 128, ?, ?] [0706 13:50:07 @registry.py:90] 'group1/block3/conv3': [1, 128, ?, ?] --> [1, 512, ?, ?] [0706 13:50:07 @registry.py:90] 'group2/block0/conv1': [1, 512, ?, ?] --> [1, 256, ?, ?] [0706 13:50:07 @registry.py:90] 'group2/block0/conv2': [1, 256, ?, ?] --> [1, 256, ?, ?] [0706 13:50:07 @registry.py:90] 'group2/block0/conv3': [1, 256, ?, ?] --> [1, 1024, ?, ?] [0706 13:50:07 @registry.py:90] 'group2/block0/convshortcut': [1, 512, ?, ?] --> [1, 1024, ?, ?] [0706 13:50:07 @registry.py:90] 'group2/block1/conv1': [1, 1024, ?, ?] --> [1, 256, ?, ?] [0706 13:50:07 @registry.py:90] 'group2/block1/conv2': [1, 256, ?, ?] --> [1, 256, ?, ?] [0706 13:50:07 @registry.py:90] 'group2/block1/conv3': [1, 256, ?, ?] --> [1, 1024, ?, ?] [0706 13:50:07 @registry.py:90] 'group2/block2/conv1': [1, 1024, ?, ?] --> [1, 256, ?, ?] [0706 13:50:07 @registry.py:90] 'group2/block2/conv2': [1, 256, ?, ?] --> [1, 256, ?, ?] [0706 13:50:07 @registry.py:90] 'group2/block2/conv3': [1, 256, ?, ?] --> [1, 1024, ?, ?] [0706 13:50:07 @registry.py:90] 'group2/block3/conv1': [1, 1024, ?, ?] --> [1, 256, ?, ?] [0706 13:50:07 @registry.py:90] 'group2/block3/conv2': [1, 256, ?, ?] --> [1, 256, ?, ?] [0706 13:50:07 @registry.py:90] 'group2/block3/conv3': [1, 256, ?, ?] --> [1, 1024, ?, ?] [0706 13:50:07 @registry.py:90] 'group2/block4/conv1': [1, 1024, ?, ?] --> [1, 256, ?, ?] [0706 13:50:07 @registry.py:90] 'group2/block4/conv2': [1, 256, ?, ?] --> [1, 256, ?, ?] [0706 13:50:07 @registry.py:90] 'group2/block4/conv3': [1, 256, ?, ?] --> [1, 1024, ?, ?] [0706 13:50:07 @registry.py:90] 'group2/block5/conv1': [1, 1024, ?, ?] --> [1, 256, ?, ?] [0706 13:50:07 @registry.py:90] 'group2/block5/conv2': [1, 256, ?, ?] --> [1, 256, ?, ?] [0706 13:50:07 @registry.py:90] 'group2/block5/conv3': [1, 256, ?, ?] --> [1, 1024, ?, ?] [0706 13:50:07 @registry.py:90] 'group3/block0/conv1': [1, 1024, ?, ?] --> [1, 512, ?, ?] [0706 13:50:07 @registry.py:90] 'group3/block0/conv2': [1, 512, ?, ?] --> [1, 512, ?, ?] [0706 13:50:07 @registry.py:90] 'group3/block0/conv3': [1, 512, ?, ?] --> [1, 2048, ?, ?] [0706 13:50:07 @registry.py:90] 'group3/block0/convshortcut': [1, 1024, ?, ?] --> [1, 2048, ?, ?] [0706 13:50:07 @registry.py:90] 'group3/block1/conv1': [1, 2048, ?, ?] --> [1, 512, ?, ?] [0706 13:50:07 @registry.py:90] 'group3/block1/conv2': [1, 512, ?, ?] --> [1, 512, ?, ?] [0706 13:50:07 @registry.py:90] 'group3/block1/conv3': [1, 512, ?, ?] --> [1, 2048, ?, ?] [0706 13:50:07 @registry.py:90] 'group3/block2/conv1': [1, 2048, ?, ?] --> [1, 512, ?, ?] [0706 13:50:07 @registry.py:90] 'group3/block2/conv2': [1, 512, ?, ?] --> [1, 512, ?, ?] [0706 13:50:07 @registry.py:90] 'group3/block2/conv3': [1, 512, ?, ?] --> [1, 2048, ?, ?] [0706 13:50:07 @registry.py:80] 'fpn' input: [1, 256, ?, ?], [1, 512, ?, ?], [1, 1024, ?, ?], [1, 2048, ?, ?] [0706 13:50:07 @registry.py:90] 'fpn/lateral_1x1_c2': [1, 256, ?, ?] --> [1, 256, ?, ?] [0706 13:50:07 @registry.py:90] 'fpn/lateral_1x1_c3': [1, 512, ?, ?] --> [1, 256, ?, ?] [0706 13:50:07 @registry.py:90] 'fpn/lateral_1x1_c4': [1, 1024, ?, ?] --> [1, 256, ?, ?] [0706 13:50:07 @registry.py:90] 'fpn/lateral_1x1_c5': [1, 2048, ?, ?] --> [1, 256, ?, ?] [0706 13:50:07 @registry.py:90] 'fpn/upsample_lat5': [1, 256, ?, ?] --> [1, 256, ?, ?] [0706 13:50:07 @registry.py:90] 'fpn/upsample_lat4': [1, 256, ?, ?] --> [1, 256, ?, ?] [0706 13:50:08 @registry.py:90] 'fpn/upsample_lat3': [1, 256, ?, ?] --> [1, 256, ?, ?] [0706 13:50:08 @registry.py:90] 'fpn/posthoc_3x3_p2': [1, 256, ?, ?] --> [1, 256, ?, ?] [0706 13:50:08 @registry.py:90] 'fpn/posthoc_3x3_p3': [1, 256, ?, ?] --> [1, 256, ?, ?] [0706 13:50:08 @registry.py:90] 'fpn/posthoc_3x3_p4': [1, 256, ?, ?] --> [1, 256, ?, ?] [0706 13:50:08 @registry.py:90] 'fpn/posthoc_3x3_p5': [1, 256, ?, ?] --> [1, 256, ?, ?] [0706 13:50:08 @registry.py:90] 'fpn/maxpool_p6': [1, 256, ?, ?] --> [1, 256, ?, ?] [0706 13:50:08 @registry.py:93] 'fpn' output: [1, 256, ?, ?], [1, 256, ?, ?], [1, 256, ?, ?], [1, 256, ?, ?], [1, 256, ?, ?] [0706 13:50:08 @registry.py:80] 'rpn' input: [1, 256, ?, ?] [0706 13:50:08 @registry.py:90] 'rpn/conv0': [1, 256, ?, ?] --> [1, 256, ?, ?] [0706 13:50:08 @registry.py:90] 'rpn/class': [1, 256, ?, ?] --> [1, 3, ?, ?] [0706 13:50:08 @registry.py:90] 'rpn/box': [1, 256, ?, ?] --> [1, 12, ?, ?] [0706 13:50:08 @registry.py:93] 'rpn' output: [?, ?, 3], [?, ?, 3, 4] [0706 13:50:09 @registry.py:80] 'fastrcnn' input: [?, 256, 7, 7] [0706 13:50:10 @registry.py:90] 'fastrcnn/fc6': [?, 256, 7, 7] --> [?, 1024] [0706 13:50:10 @registry.py:90] 'fastrcnn/fc7': [?, 1024] --> [?, 1024] [0706 13:50:10 @registry.py:93] 'fastrcnn' output: [?, 1024] [0706 13:50:10 @registry.py:80] 'fastrcnn/outputs' input: [?, 1024] [0706 13:50:10 @registry.py:90] 'fastrcnn/outputs/class': [?, 1024] --> [?, 6] [0706 13:50:10 @registry.py:90] 'fastrcnn/outputs/box': [?, 1024] --> [?, 24] [0706 13:50:10 @registry.py:93] 'fastrcnn/outputs' output: [?, 6], [?, 6, 4] [0706 13:50:10 @regularize.py:97] regularize_cost() found 57 variables to regularize. [0706 13:50:10 @regularize.py:21] The following tensors will be regularized: group1/block0/conv1/W:0, group1/block0/conv2/W:0, group1/block0/conv3/W:0, group1/block0/convshortcut/W:0, group1/block1/conv1/W:0, group1/block1/conv2/W:0, group1/block1/conv3/W:0, group1/block2/conv1/W:0, group1/block2/conv2/W:0, group1/block2/conv3/W:0, group1/block3/conv1/W:0, group1/block3/conv2/W:0, group1/block3/conv3/W:0, group2/block0/conv1/W:0, group2/block0/conv2/W:0, group2/block0/conv3/W:0, group2/block0/convshortcut/W:0, group2/block1/conv1/W:0, group2/block1/conv2/W:0, group2/block1/conv3/W:0, group2/block2/conv1/W:0, group2/block2/conv2/W:0, group2/block2/conv3/W:0, group2/block3/conv1/W:0, group2/block3/conv2/W:0, group2/block3/conv3/W:0, group2/block4/conv1/W:0, group2/block4/conv2/W:0, group2/block4/conv3/W:0, group2/block5/conv1/W:0, group2/block5/conv2/W:0, group2/block5/conv3/W:0, group3/block0/conv1/W:0, group3/block0/conv2/W:0, group3/block0/conv3/W:0, group3/block0/convshortcut/W:0, group3/block1/conv1/W:0, group3/block1/conv2/W:0, group3/block1/conv3/W:0, group3/block2/conv1/W:0, group3/block2/conv2/W:0, group3/block2/conv3/W:0, fpn/lateral_1x1_c2/W:0, fpn/lateral_1x1_c3/W:0, fpn/lateral_1x1_c4/W:0, fpn/lateral_1x1_c5/W:0, fpn/posthoc_3x3_p2/W:0, fpn/posthoc_3x3_p3/W:0, fpn/posthoc_3x3_p4/W:0, fpn/posthoc_3x3_p5/W:0, rpn/conv0/W:0, rpn/class/W:0, rpn/box/W:0, fastrcnn/fc6/W:0, fastrcnn/fc7/W:0, fastrcnn/outputs/class/W:0, fastrcnn/outputs/box/W:0 [0706 13:50:12 @training.py:108] Building graph for training tower 1 on device /gpu:1 ... [0706 13:50:14 @regularize.py:97] regularize_cost() found 57 variables to regularize. [0706 13:50:16 @collection.py:152] Size of these collections were changed in tower1: (tf.GraphKeys.MODEL_VARIABLES: 161->194) [0706 13:50:16 @collection.py:165] These collections were modified but restored in tower1: (tf.GraphKeys.SUMMARIES: 76->77) [0706 13:50:20 @training.py:350] 'sync_variables_from_main_tower' includes 607 operations. [0706 13:50:20 @model_utils.py:67] List of Trainable Variables: name shape #elements


group1/block0/conv1/W [1, 1, 256, 128] 32768 group1/block0/conv1/bn/gamma [128] 128 group1/block0/conv1/bn/beta [128] 128 group1/block0/conv2/W [3, 3, 128, 128] 147456 group1/block0/conv2/bn/gamma [128] 128 group1/block0/conv2/bn/beta [128] 128 group1/block0/conv3/W [1, 1, 128, 512] 65536 group1/block0/conv3/bn/gamma [512] 512 group1/block0/conv3/bn/beta [512] 512 group1/block0/convshortcut/W [1, 1, 256, 512] 131072 group1/block0/convshortcut/bn/gamma [512] 512 group1/block0/convshortcut/bn/beta [512] 512 group1/block1/conv1/W [1, 1, 512, 128] 65536 group1/block1/conv1/bn/gamma [128] 128 group1/block1/conv1/bn/beta [128] 128 group1/block1/conv2/W [3, 3, 128, 128] 147456 group1/block1/conv2/bn/gamma [128] 128 group1/block1/conv2/bn/beta [128] 128 group1/block1/conv3/W [1, 1, 128, 512] 65536 group1/block1/conv3/bn/gamma [512] 512 group1/block1/conv3/bn/beta [512] 512 group1/block2/conv1/W [1, 1, 512, 128] 65536 group1/block2/conv1/bn/gamma [128] 128 group1/block2/conv1/bn/beta [128] 128 group1/block2/conv2/W [3, 3, 128, 128] 147456 group1/block2/conv2/bn/gamma [128] 128 group1/block2/conv2/bn/beta [128] 128 group1/block2/conv3/W [1, 1, 128, 512] 65536 group1/block2/conv3/bn/gamma [512] 512 group1/block2/conv3/bn/beta [512] 512 group1/block3/conv1/W [1, 1, 512, 128] 65536 group1/block3/conv1/bn/gamma [128] 128 group1/block3/conv1/bn/beta [128] 128 group1/block3/conv2/W [3, 3, 128, 128] 147456 group1/block3/conv2/bn/gamma [128] 128 group1/block3/conv2/bn/beta [128] 128 group1/block3/conv3/W [1, 1, 128, 512] 65536 group1/block3/conv3/bn/gamma [512] 512 group1/block3/conv3/bn/beta [512] 512 group2/block0/conv1/W [1, 1, 512, 256] 131072 group2/block0/conv1/bn/gamma [256] 256 group2/block0/conv1/bn/beta [256] 256 group2/block0/conv2/W [3, 3, 256, 256] 589824 group2/block0/conv2/bn/gamma [256] 256 group2/block0/conv2/bn/beta [256] 256 group2/block0/conv3/W [1, 1, 256, 1024] 262144 group2/block0/conv3/bn/gamma [1024] 1024 group2/block0/conv3/bn/beta [1024] 1024 group2/block0/convshortcut/W [1, 1, 512, 1024] 524288 group2/block0/convshortcut/bn/gamma [1024] 1024 group2/block0/convshortcut/bn/beta [1024] 1024 group2/block1/conv1/W [1, 1, 1024, 256] 262144 group2/block1/conv1/bn/gamma [256] 256 group2/block1/conv1/bn/beta [256] 256 group2/block1/conv2/W [3, 3, 256, 256] 589824 group2/block1/conv2/bn/gamma [256] 256 group2/block1/conv2/bn/beta [256] 256 group2/block1/conv3/W [1, 1, 256, 1024] 262144 group2/block1/conv3/bn/gamma [1024] 1024 group2/block1/conv3/bn/beta [1024] 1024 group2/block2/conv1/W [1, 1, 1024, 256] 262144 group2/block2/conv1/bn/gamma [256] 256 group2/block2/conv1/bn/beta [256] 256 group2/block2/conv2/W [3, 3, 256, 256] 589824 group2/block2/conv2/bn/gamma [256] 256 group2/block2/conv2/bn/beta [256] 256 group2/block2/conv3/W [1, 1, 256, 1024] 262144 group2/block2/conv3/bn/gamma [1024] 1024 group2/block2/conv3/bn/beta [1024] 1024 group2/block3/conv1/W [1, 1, 1024, 256] 262144 group2/block3/conv1/bn/gamma [256] 256 group2/block3/conv1/bn/beta [256] 256 group2/block3/conv2/W [3, 3, 256, 256] 589824 group2/block3/conv2/bn/gamma [256] 256 group2/block3/conv2/bn/beta [256] 256 group2/block3/conv3/W [1, 1, 256, 1024] 262144 group2/block3/conv3/bn/gamma [1024] 1024 group2/block3/conv3/bn/beta [1024] 1024 group2/block4/conv1/W [1, 1, 1024, 256] 262144 group2/block4/conv1/bn/gamma [256] 256 group2/block4/conv1/bn/beta [256] 256 group2/block4/conv2/W [3, 3, 256, 256] 589824 group2/block4/conv2/bn/gamma [256] 256 group2/block4/conv2/bn/beta [256] 256 group2/block4/conv3/W [1, 1, 256, 1024] 262144 group2/block4/conv3/bn/gamma [1024] 1024 group2/block4/conv3/bn/beta [1024] 1024 group2/block5/conv1/W [1, 1, 1024, 256] 262144 group2/block5/conv1/bn/gamma [256] 256 group2/block5/conv1/bn/beta [256] 256 group2/block5/conv2/W [3, 3, 256, 256] 589824 group2/block5/conv2/bn/gamma [256] 256 group2/block5/conv2/bn/beta [256] 256 group2/block5/conv3/W [1, 1, 256, 1024] 262144 group2/block5/conv3/bn/gamma [1024] 1024 group2/block5/conv3/bn/beta [1024] 1024 group3/block0/conv1/W [1, 1, 1024, 512] 524288 group3/block0/conv1/bn/gamma [512] 512 group3/block0/conv1/bn/beta [512] 512 group3/block0/conv2/W [3, 3, 512, 512] 2359296 group3/block0/conv2/bn/gamma [512] 512 group3/block0/conv2/bn/beta [512] 512 group3/block0/conv3/W [1, 1, 512, 2048] 1048576 group3/block0/conv3/bn/gamma [2048] 2048 group3/block0/conv3/bn/beta [2048] 2048 group3/block0/convshortcut/W [1, 1, 1024, 2048] 2097152 group3/block0/convshortcut/bn/gamma [2048] 2048 group3/block0/convshortcut/bn/beta [2048] 2048 group3/block1/conv1/W [1, 1, 2048, 512] 1048576 group3/block1/conv1/bn/gamma [512] 512 group3/block1/conv1/bn/beta [512] 512 group3/block1/conv2/W [3, 3, 512, 512] 2359296 group3/block1/conv2/bn/gamma [512] 512 group3/block1/conv2/bn/beta [512] 512 group3/block1/conv3/W [1, 1, 512, 2048] 1048576 group3/block1/conv3/bn/gamma [2048] 2048 group3/block1/conv3/bn/beta [2048] 2048 group3/block2/conv1/W [1, 1, 2048, 512] 1048576 group3/block2/conv1/bn/gamma [512] 512 group3/block2/conv1/bn/beta [512] 512 group3/block2/conv2/W [3, 3, 512, 512] 2359296 group3/block2/conv2/bn/gamma [512] 512 group3/block2/conv2/bn/beta [512] 512 group3/block2/conv3/W [1, 1, 512, 2048] 1048576 group3/block2/conv3/bn/gamma [2048] 2048 group3/block2/conv3/bn/beta [2048] 2048 fpn/lateral_1x1_c2/W [1, 1, 256, 256] 65536 fpn/lateral_1x1_c2/b [256] 256 fpn/lateral_1x1_c3/W [1, 1, 512, 256] 131072 fpn/lateral_1x1_c3/b [256] 256 fpn/lateral_1x1_c4/W [1, 1, 1024, 256] 262144 fpn/lateral_1x1_c4/b [256] 256 fpn/lateral_1x1_c5/W [1, 1, 2048, 256] 524288 fpn/lateral_1x1_c5/b [256] 256 fpn/posthoc_3x3_p2/W [3, 3, 256, 256] 589824 fpn/posthoc_3x3_p2/b [256] 256 fpn/posthoc_3x3_p3/W [3, 3, 256, 256] 589824 fpn/posthoc_3x3_p3/b [256] 256 fpn/posthoc_3x3_p4/W [3, 3, 256, 256] 589824 fpn/posthoc_3x3_p4/b [256] 256 fpn/posthoc_3x3_p5/W [3, 3, 256, 256] 589824 fpn/posthoc_3x3_p5/b [256] 256 rpn/conv0/W [3, 3, 256, 256] 589824 rpn/conv0/b [256] 256 rpn/class/W [1, 1, 256, 3] 768 rpn/class/b [3] 3 rpn/box/W [1, 1, 256, 12] 3072 rpn/box/b [12] 12 fastrcnn/fc6/W [12544, 1024] 12845056 fastrcnn/fc6/b [1024] 1024 fastrcnn/fc7/W [1024, 1024] 1048576 fastrcnn/fc7/b [1024] 1024 fastrcnn/outputs/class/W [1024, 6] 6144 fastrcnn/outputs/class/b [6] 6 fastrcnn/outputs/box/W [1024, 24] 24576 fastrcnn/outputs/box/b [24] 24 Number of trainable variables: 156 Number of parameters (elements): 41147437 Storage space needed for all trainable variables: 156.97MB [0706 13:50:20 @base.py:207] Setup callbacks graph ...

/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gradients_util.py:93: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gradients_util.py:93: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " [0706 13:50:27 @argtools.py:138] WRN "import prctl" failed! Install python-prctl so that processes can be cleaned with guarantee. [0706 13:50:29 @prof.py:291] [HostMemoryTracker] Free RAM in setup_graph() is 364.27 GB. [0706 13:50:29 @tower.py:135] Building graph for predict tower 'tower-pred-0' on device /gpu:0 ... [0706 13:50:30 @collection.py:152] Size of these collections were changed in tower-pred-0: (tf.GraphKeys.MODEL_VARIABLES: 194->227) [0706 13:50:30 @collection.py:165] These collections were modified but restored in tower-pred-0: (tf.GraphKeys.SUMMARIES: 76->77) [0706 13:50:30 @tower.py:135] Building graph for predict tower 'tower-pred-1' on device /gpu:1 with variable scope 'tower1'... [0706 13:50:31 @collection.py:152] Size of these collections were changed in tower-pred-1: (tf.GraphKeys.MODEL_VARIABLES: 227->260) [0706 13:50:31 @collection.py:165] These collections were modified but restored in tower-pred-1: (tf.GraphKeys.SUMMARIES: 76->77) loading annotations into memory... Done (t=0.75s) creating index... index created! [0706 13:50:31 @coco.py:60] Instances loaded from /home/vlamp/Documents/STAC/DATA_STAC/coco/annotations/instances_val2017.json.

0%| | 0/9921 [00:00<?, ?it/s] 100%|##########| 9921/9921 [00:00<00:00, 725119.19it/s][0706 13:50:31 @timer.py:45] Load annotations for instances_val2017.json finished, time:0.0151 sec. [0706 13:50:31 @data.py:456] Found 9921 images for inference. loading annotations into memory... Done (t=0.83s) creating index... index created! [0706 13:50:32 @coco.py:60] Instances loaded from /home/vlamp/Documents/STAC/DATA_STAC/coco/annotations/instances_val2017.json.

0%| | 0/9921 [00:00<?, ?it/s] 100%|##########| 9921/9921 [00:00<00:00, 739211.43it/s][0706 13:50:32 @timer.py:45] Load annotations for instances_val2017.json finished, time:0.0150 sec. [0706 13:50:32 @data.py:456] Found 9921 images for inference. loading annotations into memory... Done (t=0.82s) creating index... index created! [0706 13:50:33 @coco.py:60] Instances loaded from /home/vlamp/Documents/STAC/DATA_STAC/coco/annotations/instances_val2017.json.

0%| | 0/9921 [00:00<?, ?it/s] 100%|##########| 9921/9921 [00:00<00:00, 744062.40it/s][0706 13:50:33 @timer.py:45] Load annotations for instances_val2017.json finished, time:0.0149 sec. [0706 13:50:33 @data.py:456] Found 9921 images for inference. loading annotations into memory... Done (t=0.77s) creating index... index created! [0706 13:50:34 @coco.py:60] Instances loaded from /home/vlamp/Documents/STAC/DATA_STAC/coco/annotations/instances_val2017.json.

0%| | 0/9921 [00:00<?, ?it/s] 100%|##########| 9921/9921 [00:00<00:00, 713481.88it/s][0706 13:50:34 @timer.py:45] Load annotations for instances_val2017.json finished, time:0.0153 sec. [0706 13:50:34 @data.py:456] Found 9921 images for inference. [0706 13:50:34 @summary.py:47] [MovingAverageSummary] 73 operations in collection 'MOVING_SUMMARY_OPS' will be run with session hooks. [0706 13:50:34 @summary.py:94] Summarizing collection 'summaries' of size 76. [0706 13:50:34 @base.py:228] Creating the session ... 2020-07-06 13:50:34.737615: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA 2020-07-06 13:50:34.743032: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1 2020-07-06 13:50:34.887781: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x14c78d20 executing computations on platform CUDA. Devices: 2020-07-06 13:50:34.887822: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): Tesla T4, Compute Capability 7.5 2020-07-06 13:50:34.887827: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (1): Tesla T4, Compute Capability 7.5 2020-07-06 13:50:34.890055: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2494125000 Hz 2020-07-06 13:50:34.893901: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x14a0c4f0 executing computations on platform Host. Devices: 2020-07-06 13:50:34.893919: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): , 2020-07-06 13:50:34.896069: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59 pciBusID: 0000:3b:00.0Could not dlopen library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/cm/shared/apps/slur 2020-07-06 13:50:34.896771: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 1 with properties: name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59 pciBusID: 0000:d8:00.0 2020-07-06 13:50:34.897783: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] m/current/lib64:/cm/shared/apps/slurm/current/lib64/slurm:/.singularity.d/libs:/usr/local/cuda-10.0/lib64/ 2020-07-06 13:50:34.898069: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/cm/shared/apps/slurm/current/lib64:/cm/shared/apps/slurm/current/lib64/slurm:/.singularity.d/libs:/usr/local/cuda-10.0/lib64/ 2020-07-06 13:50:34.898242: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/cm/shared/apps/slurm/current/lib64:/cm/shared/apps/slurm/current/lib64/slurm:/.singularity.d/libs:/usr/local/cuda-10.0/lib64/ 2020-07-06 13:50:34.898401: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/cm/shared/apps/slurm/current/lib64:/cm/shared/apps/slurm/current/lib64/slurm:/.singularity.d/libs:/usr/local/cuda-10.0/lib64/ 2020-07-06 13:50:34.898538: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/cm/shared/apps/slurm/current/lib64:/cm/shared/apps/slurm/current/lib64/slurm:/.singularity.d/libs:/usr/local/cuda-10.0/lib64/ 2020-07-06 13:50:34.898705: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/cm/shared/apps/slurm/current/lib64:/cm/shared/apps/slurm/current/lib64/slurm:/.singularity.d/libs:/usr/local/cuda-10.0/lib64/ 2020-07-06 13:50:34.901746: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7 2020-07-06 13:50:34.901764: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1663] Cannot dlopen some GPU libraries. Skipping registering GPU devices... 2020-07-06 13:50:34.901834: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-07-06 13:50:34.901840: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0 1 2020-07-06 13:50:34.901845: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N Y 2020-07-06 13:50:34.901848: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 1: Y N

MultiProcessMapDataZMQ successfully cleaned-up. Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1356, in _do_call return fn(*args) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1339, in _run_fn self._extend_graph() File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1374, in _extend_graph tf_session.ExtendSession(self._session) tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'NcclAllReduce' used by {{node AllReduceGrads/NcclAllReduce}}with these attrs: [shared_name="c0", T=DT_FLOAT, num_devices=2, reduction="sum"] Registered devices: [CPU, XLA_CPU, XLA_GPU] Registered kernels: device='GPU'

 [[AllReduceGrads/NcclAllReduce]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/vlamp/Documents/STAC/detection/train_stg1_bdd.py", line 180, in launch_train_with_config(traincfg, trainer) File "/usr/local/lib/python3.6/dist-packages/tensorpack/train/interface.py", line 99, in launch_train_with_config extra_callbacks=config.extra_callbacks) File "/usr/local/lib/python3.6/dist-packages/tensorpack/train/base.py", line 342, in train_with_defaults steps_per_epoch, starting_epoch, max_epoch) File "/usr/local/lib/python3.6/dist-packages/tensorpack/train/base.py", line 313, in train self.initialize(session_creator, session_init) File "/usr/local/lib/python3.6/dist-packages/tensorpack/utils/argtools.py", line 168, in wrapper return func(*args, *kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorpack/train/tower.py", line 147, in initialize super(TowerTrainer, self).initialize(session_creator, session_init) File "/usr/local/lib/python3.6/dist-packages/tensorpack/utils/argtools.py", line 168, in wrapper return func(args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorpack/train/base.py", line 230, in initialize self.sess = session_creator.create_session() File "/usr/local/lib/python3.6/dist-packages/tensorpack/tfutils/sesscreate.py", line 88, in create_session run(tf.global_variables_initializer()) File "/usr/local/lib/python3.6/dist-packages/tensorpack/tfutils/sesscreate.py", line 86, in run sess.run(op) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 950, in run run_metadata_ptr) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1173, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1350, in _do_run run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1370, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'NcclAllReduce' used by node AllReduceGrads/NcclAllReduce (defined at usr/local/lib/python3.6/dist-packages/tensorpack/graph_builder/utils.py:154) with these attrs: [shared_name="c0", T=DT_FLOAT, num_devices=2, reduction="sum"] Registered devices: [CPU, XLA_CPU, XLA_GPU] Registered kernels: device='GPU'

 [[AllReduceGrads/NcclAllReduce]]

Errors may have originated from an input operation. Input Source operations connected to node AllReduceGrads/NcclAllReduce: tower0/gradients/AddN_126 (defined at usr/local/lib/python3.6/dist-packages/tensorpack/tfutils/optimizer.py:29) /cm/local/apps/slurm/var/spool/job18434303/slurm_script: line 29: t: command not found

zizhaozhang commented 4 years ago

It seems the cuda version. Pls check if tensorflow version is 1.14 and cuda is a compatible version.

sisrfeng commented 3 years ago

I also encounter this:

2020-12-23 10:18:39.085280: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: can
not open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/loc
al/nvidia/lib64

But image

In https://www.tensorflow.org/install/gpu 中文官网推荐CUDA 10.1 image 英文官网推荐CUDA 11 image (中文官网滞后于英文?) tensorflow >1.13 should goes right 10.1