Closed vaslamp closed 4 years ago
It seems the cuda version. Pls check if tensorflow version is 1.14 and cuda is a compatible version.
I also encounter this:
2020-12-23 10:18:39.085280: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: can
not open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/loc
al/nvidia/lib64
But
In https://www.tensorflow.org/install/gpu 中文官网推荐CUDA 10.1 英文官网推荐CUDA 11 (中文官网滞后于英文?) tensorflow >1.13 should goes right 10.1
Can you please interpret me the following error? Is it a problem with CUDA version? I am not that much experienced and I would like to know so that I can solve it and continue.
[33mWARNING:[0m NVIDIA binaries may not be bound with --writable [32m[0706 13:49:52 @voc.py:279][0m Register dataset ['VOC2007/instances_trainval', 'VOC2007/instances_test', 'VOC2012/instances_trainval'] [32m[0706 13:49:52 @coco.py:271][0m Register dataset ['VOC2007/instances_trainval', 'VOC2007/instances_test', 'VOC2012/instances_trainval', 'train2017', 'val2017', 'coco_train2017', 'coco_val2017', 'coco_train2014', 'coco_val2014', 'coco_valminusminival2014', 'coco_minival2014', 'coco_val2017_100'] [32m[0706 13:49:52 @coco.py:205][0m Register dataset ['VOC2007/instances_trainval', 'VOC2007/instances_test', 'VOC2012/instances_trainval', 'train2017', 'val2017', 'coco_train2017', 'coco_val2017', 'coco_train2014', 'coco_val2014', 'coco_valminusminival2014', 'coco_minival2014', 'coco_val2017_100', 'coco_train2017.1@1', 'coco_train2017.1@1-unlabeled', 'coco_train2017.1@2', 'coco_train2017.1@2-unlabeled', 'coco_train2017.1@5', 'coco_train2017.1@5-unlabeled', 'coco_train2017.1@10', 'coco_train2017.1@10-unlabeled', 'coco_train2017.1@20', 'coco_train2017.1@20-unlabeled', 'coco_train2017.1@30', 'coco_train2017.1@30-unlabeled', 'coco_train2017.1@40', 'coco_train2017.1@40-unlabeled', 'coco_train2017.1@50', 'coco_train2017.1@50-unlabeled', 'coco_train2017.2@1', 'coco_train2017.2@1-unlabeled', 'coco_train2017.2@2', 'coco_train2017.2@2-unlabeled', 'coco_train2017.2@5', 'coco_train2017.2@5-unlabeled', 'coco_train2017.2@10', 'coco_train2017.2@10-unlabeled', 'coco_train2017.2@20', 'coco_train2017.2@20-unlabeled', 'coco_train2017.2@30', 'coco_train2017.2@30-unlabeled', 'coco_train2017.2@40', 'coco_train2017.2@40-unlabeled', 'coco_train2017.2@50', 'coco_train2017.2@50-unlabeled', 'coco_train2017.3@1', 'coco_train2017.3@1-unlabeled', 'coco_train2017.3@2', 'coco_train2017.3@2-unlabeled', 'coco_train2017.3@5', 'coco_train2017.3@5-unlabeled', 'coco_train2017.3@10', 'coco_train2017.3@10-unlabeled', 'coco_train2017.3@20', 'coco_train2017.3@20-unlabeled', 'coco_train2017.3@30', 'coco_train2017.3@30-unlabeled', 'coco_train2017.3@40', 'coco_train2017.3@40-unlabeled', 'coco_train2017.3@50', 'coco_train2017.3@50-unlabeled', 'coco_train2017.4@1', 'coco_train2017.4@1-unlabeled', 'coco_train2017.4@2', 'coco_train2017.4@2-unlabeled', 'coco_train2017.4@5', 'coco_train2017.4@5-unlabeled', 'coco_train2017.4@10', 'coco_train2017.4@10-unlabeled', 'coco_train2017.4@20', 'coco_train2017.4@20-unlabeled', 'coco_train2017.4@30', 'coco_train2017.4@30-unlabeled', 'coco_train2017.4@40', 'coco_train2017.4@40-unlabeled', 'coco_train2017.4@50', 'coco_train2017.4@50-unlabeled', 'coco_train2017.5@1', 'coco_train2017.5@1-unlabeled', 'coco_train2017.5@2', 'coco_train2017.5@2-unlabeled', 'coco_train2017.5@5', 'coco_train2017.5@5-unlabeled', 'coco_train2017.5@10', 'coco_train2017.5@10-unlabeled', 'coco_train2017.5@20', 'coco_train2017.5@20-unlabeled', 'coco_train2017.5@30', 'coco_train2017.5@30-unlabeled', 'coco_train2017.5@40', 'coco_train2017.5@40-unlabeled', 'coco_train2017.5@50', 'coco_train2017.5@50-unlabeled', 'coco_train2017.0@100-extra', 'coco_train2017.0@100-extra-unlabeled', 'coco_unlabeled2017'] [32m[0706 13:49:52 @coco.py:260][0m Register dataset ['VOC2007/instances_trainval', 'VOC2007/instances_test', 'VOC2012/instances_trainval', 'train2017', 'val2017', 'coco_train2017', 'coco_val2017', 'coco_train2014', 'coco_val2014', 'coco_valminusminival2014', 'coco_minival2014', 'coco_val2017_100', 'coco_train2017.1@1', 'coco_train2017.1@1-unlabeled', 'coco_train2017.1@2', 'coco_train2017.1@2-unlabeled', 'coco_train2017.1@5', 'coco_train2017.1@5-unlabeled', 'coco_train2017.1@10', 'coco_train2017.1@10-unlabeled', 'coco_train2017.1@20', 'coco_train2017.1@20-unlabeled', 'coco_train2017.1@30', 'coco_train2017.1@30-unlabeled', 'coco_train2017.1@40', 'coco_train2017.1@40-unlabeled', 'coco_train2017.1@50', 'coco_train2017.1@50-unlabeled', 'coco_train2017.2@1', 'coco_train2017.2@1-unlabeled', 'coco_train2017.2@2', 'coco_train2017.2@2-unlabeled', 'coco_train2017.2@5', 'coco_train2017.2@5-unlabeled', 'coco_train2017.2@10', 'coco_train2017.2@10-unlabeled', 'coco_train2017.2@20', 'coco_train2017.2@20-unlabeled', 'coco_train2017.2@30', 'coco_train2017.2@30-unlabeled', 'coco_train2017.2@40', 'coco_train2017.2@40-unlabeled', 'coco_train2017.2@50', 'coco_train2017.2@50-unlabeled', 'coco_train2017.3@1', 'coco_train2017.3@1-unlabeled', 'coco_train2017.3@2', 'coco_train2017.3@2-unlabeled', 'coco_train2017.3@5', 'coco_train2017.3@5-unlabeled', 'coco_train2017.3@10', 'coco_train2017.3@10-unlabeled', 'coco_train2017.3@20', 'coco_train2017.3@20-unlabeled', 'coco_train2017.3@30', 'coco_train2017.3@30-unlabeled', 'coco_train2017.3@40', 'coco_train2017.3@40-unlabeled', 'coco_train2017.3@50', 'coco_train2017.3@50-unlabeled', 'coco_train2017.4@1', 'coco_train2017.4@1-unlabeled', 'coco_train2017.4@2', 'coco_train2017.4@2-unlabeled', 'coco_train2017.4@5', 'coco_train2017.4@5-unlabeled', 'coco_train2017.4@10', 'coco_train2017.4@10-unlabeled', 'coco_train2017.4@20', 'coco_train2017.4@20-unlabeled', 'coco_train2017.4@30', 'coco_train2017.4@30-unlabeled', 'coco_train2017.4@40', 'coco_train2017.4@40-unlabeled', 'coco_train2017.4@50', 'coco_train2017.4@50-unlabeled', 'coco_train2017.5@1', 'coco_train2017.5@1-unlabeled', 'coco_train2017.5@2', 'coco_train2017.5@2-unlabeled', 'coco_train2017.5@5', 'coco_train2017.5@5-unlabeled', 'coco_train2017.5@10', 'coco_train2017.5@10-unlabeled', 'coco_train2017.5@20', 'coco_train2017.5@20-unlabeled', 'coco_train2017.5@30', 'coco_train2017.5@30-unlabeled', 'coco_train2017.5@40', 'coco_train2017.5@40-unlabeled', 'coco_train2017.5@50', 'coco_train2017.5@50-unlabeled', 'coco_train2017.0@100-extra', 'coco_train2017.0@100-extra-unlabeled', 'coco_unlabeled2017', 'coco_unlabeledtrainval20class'] [32m[0706 13:49:52 @logger.py:138][0m Directory '/home/vlamp/Documents/STAC/RESULTS' backuped to '/home/vlamp/Documents/STAC/RESULTS0706-134952' [32m[0706 13:49:52 @logger.py:92][0m Argv: /home/vlamp/Documents/STAC/detection/train_stg1_bdd.py --logdir /home/vlamp/Documents/STAC/RESULTS/ --simple_path --config BACKBONE.WEIGHTS=/home/vlamp/Documents/STAC/DATA_STAC/coco/ImageNet-R50-AlignPadding.npz DATA.BASEDIR=/home/vlamp/Documents/STAC/DATA_STAC/coco MODE_MASK=False FRCNN.BATCH_PER_IM=64 PREPROC.TRAIN_SHORT_EDGE_SIZE=[500,800] TRAIN.EVAL_PERIOD=20 TRAIN.AUGTYPE_LAB=default [32m[0706 13:49:54 @train_stg1_bdd.py:87][0m Environment Information:
sys.platform linux Python 3.6.9 (default, Apr 18 2020, 01:56:04) [GCC 8.4.0] Tensorpack v0.10.1-9-g9c1b1b7b-dirty Numpy 1.16.4 TensorFlow 1.14.0/v1.14.0-rc1-22-gaf24dc91b5 TF Compiler Version 4.8.5 TF CUDA support True TF MKL support False TF XLA support False Nvidia Driver /.singularity.d/libs/libnvidia-ml.so CUDA /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudart.so.10.1.243 CUDNN /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.4 NCCL CUDA_VISIBLE_DEVICES 0,1 GPU 0,1 Tesla T4 Free RAM 369.15/376.54 GB CPU Count 40 cv2 4.2.0 msgpack 1.0.0 python-prctl False
list(_C.DATA.TRAIN) = ['train2017'] list(_C.DATA.VAL) = ('val2017',) datasets = ['train2017', 'val2017'] _C.DATA.CLASS_NAMES = ['BG', 'car', 'pedestrian', 'big vehicle', 'bicycle', 'motorcycle'] [32m[0706 13:49:54 @config.py:352][0m Config: ------------------------------------------ {'BACKBONE': {'FREEZE_AFFINE': False, 'FREEZE_AT': 2, 'NORM': 'FreezeBN', 'RESNET_NUM_BLOCKS': [3, 4, 6, 3], 'STRIDE_1X1': False, 'TF_PAD_MODE': False, 'WEIGHTS': '/home/vlamp/Documents/STAC/DATA_STAC/coco/ImageNet-R50-AlignPadding.npz'}, 'CASCADE': {'BBOX_REG_WEIGHTS': [[10.0, 10.0, 5.0, 5.0], [20.0, 20.0, 10.0, 10.0], [30.0, 30.0, 15.0, 15.0]], 'IOUS': [0.5, 0.6, 0.7]}, 'DATA': {'ABSOLUTE_COORD': True, 'BASEDIR': '/home/vlamp/Documents/STAC/DATA_STAC/coco', 'CLASS_NAMES': ['BG', 'car', 'pedestrian', 'big vehicle', 'bicycle', 'motorcycle'], 'NUM_CATEGORY': 5, 'NUM_WORKERS': 24, 'TRAIN': ('train2017',), 'UNLABEL': ('',), 'VAL': ('val2017',)}, 'EVAL': {'PSEUDO_INFERENCE': False}, 'FPN': {'ANCHOR_SIZES': (32, 64, 128, 256, 512), 'ANCHOR_STRIDES': (4, 8, 16, 32, 64), 'CASCADE': False, 'FRCNN_CONV_HEAD_DIM': 256, 'FRCNN_FC_HEAD_DIM': 1024, 'FRCNN_HEAD_FUNC': 'fastrcnn_2fc_head', 'MRCNN_HEAD_FUNC': 'maskrcnn_up4conv_head', 'NORM': 'None', 'NUM_CHANNEL': 256, 'PROPOSAL_MODE': 'Level', 'RESOLUTION_REQUIREMENT': 32}, 'FRCNN': {'BATCH_PER_IM': 64, 'BBOX_REG_WEIGHTS': [10.0, 10.0, 5.0, 5.0], 'FG_RATIO': 0.25, 'FG_THRESH': 0.5}, 'MODE_FPN': True, 'MODE_MASK': False, 'MRCNN': {'ACCURATE_PASTE': True, 'HEAD_DIM': 256}, 'PREPROC': {'MAX_SIZE': 1344.0, 'PIXEL_MEAN': [123.675, 116.28, 103.53], 'PIXEL_STD': [58.395, 57.12, 57.375], 'TEST_SHORT_EDGE_SIZE': 800, 'TRAIN_SHORT_EDGE_SIZE': [500, 800]}, 'RPN': {'ANCHOR_RATIOS': (0.5, 1.0, 2.0), 'ANCHOR_SIZES': (32, 64, 128, 256, 512), 'ANCHOR_STRIDE': 16, 'BATCH_PER_IM': 256, 'CROWD_OVERLAP_THRESH': 9.99, 'FG_RATIO': 0.5, 'HEAD_DIM': 1024, 'MIN_SIZE': 0, 'NEGATIVE_ANCHOR_THRESH': 0.3, 'NUM_ANCHOR': 15, 'POSITIVE_ANCHOR_THRESH': 0.7, 'PROPOSAL_NMS_THRESH': 0.7, 'TEST_PER_LEVEL_NMS_TOPK': 1000, 'TEST_POST_NMS_TOPK': 1000, 'TEST_PRE_NMS_TOPK': 6000, 'TRAIN_PER_LEVEL_NMS_TOPK': 2000, 'TRAIN_POST_NMS_TOPK': 2000, 'TRAIN_PRE_NMS_TOPK': 12000}, 'TEST': {'FRCNN_NMS_THRESH': 0.5, 'RESULTS_PER_IM': 100, 'RESULT_SCORE_THRESH': 0.05, 'RESULT_SCORE_THRESH_VIS': 0.5}, 'TRAIN': {'AUGTYPE': 'strong', 'AUGTYPE_LAB': 'default', 'BASE_LR': 0.01, 'CHECKPOINT_PERIOD': 20, 'CONFIDENCE': 0.9, 'EVAL_PERIOD': 20, 'GAMMA': 0.1, 'LR_SCHEDULE': [120000, 160000, 180000], 'NO_PRN_LOSS': False, 'NUM_GPUS': 2, 'STAGE': 1, 'STARTING_EPOCH': 1, 'STEPS_PER_EPOCH': 500, 'WARMUP': 1000, 'WARMUP_INIT_LR': 0.0033000000000000004, 'WEIGHT_DECAY': 0.0001, 'WU': 2.0}, 'TRAINER': 'replicated'} [32m[0706 13:49:54 @train_stg1_bdd.py:106][0m Warm Up Schedule (steps, value): [(0, 0.0033000000000000004), (1000, 0.01)] [32m[0706 13:49:54 @train_stg1_bdd.py:107][0m LR Schedule (epochs, value): [(2, 0.01), (960.0, 0.001), (1280.0, 0.00010000000000000002)] loading annotations into memory... Done (t=5.18s) creating index... index created! [32m[0706 13:49:59 @coco.py:60][0m Instances loaded from /home/vlamp/Documents/STAC/DATA_STAC/coco/annotations/instances_train2017.json.
[32m[0706 13:50:05 @data.py:416][0m Filtered 0 images which contain no non-crowd groudtruth boxes. Total #images for training: 69403 [32m[0706 13:50:05 @augmentation.py:171][0m ---------------------------------------------------------------------------------------------------- [32m[0706 13:50:05 @augmentation.py:172][0m Augmentation type default: [] [32m[0706 13:50:05 @augmentation.py:173][0m ---------------------------------------------------------------------------------------------------- [32m[0706 13:50:05 @data.py:107][0m Use affine-enabled TrainingDataPreprocessor_aug [32m[0706 13:50:05 @train_stg1_bdd.py:112][0m Total passes of the training set is: 20.748 [32m[0706 13:50:05 @sessinit.py:294][0m Loading dictionary from /home/vlamp/Documents/STAC/DATA_STAC/coco/ImageNet-R50-AlignPadding.npz ... [32m[0706 13:50:06 @training.py:48][0m [DataParallel] Training a model of 2 towers. [32m[0706 13:50:06 @interface.py:41][0m Automatically applying StagingInput on the DataFlow. [32m[0706 13:50:06 @input_source.py:221][0m Setting up the queue 'QueueInput/input_queue' for CPU prefetching ... [32m[0706 13:50:06 @training.py:108][0m Building graph for training tower 0 on device /gpu:0 ... [32m[0706 13:50:06 @argtools.py:138][0m [5m[31mWRN[0m Some BatchNorm layer uses moving_mean/moving_variance in training. [32m[0706 13:50:06 @registry.py:90][0m 'conv0': [1, 3, ?, ?] --> [1, 64, ?, ?] [32m[0706 13:50:06 @registry.py:90][0m 'pool0': [1, 64, ?, ?] --> [1, 64, ?, ?] [32m[0706 13:50:06 @registry.py:90][0m 'group0/block0/conv1': [1, 64, ?, ?] --> [1, 64, ?, ?] [32m[0706 13:50:06 @registry.py:90][0m 'group0/block0/conv2': [1, 64, ?, ?] --> [1, 64, ?, ?] [32m[0706 13:50:06 @registry.py:90][0m 'group0/block0/conv3': [1, 64, ?, ?] --> [1, 256, ?, ?] [32m[0706 13:50:06 @registry.py:90][0m 'group0/block0/convshortcut': [1, 64, ?, ?] --> [1, 256, ?, ?] [32m[0706 13:50:06 @registry.py:90][0m 'group0/block1/conv1': [1, 256, ?, ?] --> [1, 64, ?, ?] [32m[0706 13:50:06 @registry.py:90][0m 'group0/block1/conv2': [1, 64, ?, ?] --> [1, 64, ?, ?] [32m[0706 13:50:06 @registry.py:90][0m 'group0/block1/conv3': [1, 64, ?, ?] --> [1, 256, ?, ?] [32m[0706 13:50:06 @registry.py:90][0m 'group0/block2/conv1': [1, 256, ?, ?] --> [1, 64, ?, ?] [32m[0706 13:50:06 @registry.py:90][0m 'group0/block2/conv2': [1, 64, ?, ?] --> [1, 64, ?, ?] [32m[0706 13:50:06 @registry.py:90][0m 'group0/block2/conv3': [1, 64, ?, ?] --> [1, 256, ?, ?] [32m[0706 13:50:06 @registry.py:90][0m 'group1/block0/conv1': [1, 256, ?, ?] --> [1, 128, ?, ?] [32m[0706 13:50:07 @registry.py:90][0m 'group1/block0/conv2': [1, 128, ?, ?] --> [1, 128, ?, ?] [32m[0706 13:50:07 @registry.py:90][0m 'group1/block0/conv3': [1, 128, ?, ?] --> [1, 512, ?, ?] [32m[0706 13:50:07 @registry.py:90][0m 'group1/block0/convshortcut': [1, 256, ?, ?] --> [1, 512, ?, ?] [32m[0706 13:50:07 @registry.py:90][0m 'group1/block1/conv1': [1, 512, ?, ?] --> [1, 128, ?, ?] [32m[0706 13:50:07 @registry.py:90][0m 'group1/block1/conv2': [1, 128, ?, ?] --> [1, 128, ?, ?] [32m[0706 13:50:07 @registry.py:90][0m 'group1/block1/conv3': [1, 128, ?, ?] --> [1, 512, ?, ?] [32m[0706 13:50:07 @registry.py:90][0m 'group1/block2/conv1': [1, 512, ?, ?] --> [1, 128, ?, ?] [32m[0706 13:50:07 @registry.py:90][0m 'group1/block2/conv2': [1, 128, ?, ?] --> [1, 128, ?, ?] [32m[0706 13:50:07 @registry.py:90][0m 'group1/block2/conv3': [1, 128, ?, ?] --> [1, 512, ?, ?] [32m[0706 13:50:07 @registry.py:90][0m 'group1/block3/conv1': [1, 512, ?, ?] --> [1, 128, ?, ?] [32m[0706 13:50:07 @registry.py:90][0m 'group1/block3/conv2': [1, 128, ?, ?] --> [1, 128, ?, ?] [32m[0706 13:50:07 @registry.py:90][0m 'group1/block3/conv3': [1, 128, ?, ?] --> [1, 512, ?, ?] [32m[0706 13:50:07 @registry.py:90][0m 'group2/block0/conv1': [1, 512, ?, ?] --> [1, 256, ?, ?] [32m[0706 13:50:07 @registry.py:90][0m 'group2/block0/conv2': [1, 256, ?, ?] --> [1, 256, ?, ?] [32m[0706 13:50:07 @registry.py:90][0m 'group2/block0/conv3': [1, 256, ?, ?] --> [1, 1024, ?, ?] [32m[0706 13:50:07 @registry.py:90][0m 'group2/block0/convshortcut': [1, 512, ?, ?] --> [1, 1024, ?, ?] [32m[0706 13:50:07 @registry.py:90][0m 'group2/block1/conv1': [1, 1024, ?, ?] --> [1, 256, ?, ?] [32m[0706 13:50:07 @registry.py:90][0m 'group2/block1/conv2': [1, 256, ?, ?] --> [1, 256, ?, ?] [32m[0706 13:50:07 @registry.py:90][0m 'group2/block1/conv3': [1, 256, ?, ?] --> [1, 1024, ?, ?] [32m[0706 13:50:07 @registry.py:90][0m 'group2/block2/conv1': [1, 1024, ?, ?] --> [1, 256, ?, ?] [32m[0706 13:50:07 @registry.py:90][0m 'group2/block2/conv2': [1, 256, ?, ?] --> [1, 256, ?, ?] [32m[0706 13:50:07 @registry.py:90][0m 'group2/block2/conv3': [1, 256, ?, ?] --> [1, 1024, ?, ?] [32m[0706 13:50:07 @registry.py:90][0m 'group2/block3/conv1': [1, 1024, ?, ?] --> [1, 256, ?, ?] [32m[0706 13:50:07 @registry.py:90][0m 'group2/block3/conv2': [1, 256, ?, ?] --> [1, 256, ?, ?] [32m[0706 13:50:07 @registry.py:90][0m 'group2/block3/conv3': [1, 256, ?, ?] --> [1, 1024, ?, ?] [32m[0706 13:50:07 @registry.py:90][0m 'group2/block4/conv1': [1, 1024, ?, ?] --> [1, 256, ?, ?] [32m[0706 13:50:07 @registry.py:90][0m 'group2/block4/conv2': [1, 256, ?, ?] --> [1, 256, ?, ?] [32m[0706 13:50:07 @registry.py:90][0m 'group2/block4/conv3': [1, 256, ?, ?] --> [1, 1024, ?, ?] [32m[0706 13:50:07 @registry.py:90][0m 'group2/block5/conv1': [1, 1024, ?, ?] --> [1, 256, ?, ?] [32m[0706 13:50:07 @registry.py:90][0m 'group2/block5/conv2': [1, 256, ?, ?] --> [1, 256, ?, ?] [32m[0706 13:50:07 @registry.py:90][0m 'group2/block5/conv3': [1, 256, ?, ?] --> [1, 1024, ?, ?] [32m[0706 13:50:07 @registry.py:90][0m 'group3/block0/conv1': [1, 1024, ?, ?] --> [1, 512, ?, ?] [32m[0706 13:50:07 @registry.py:90][0m 'group3/block0/conv2': [1, 512, ?, ?] --> [1, 512, ?, ?] [32m[0706 13:50:07 @registry.py:90][0m 'group3/block0/conv3': [1, 512, ?, ?] --> [1, 2048, ?, ?] [32m[0706 13:50:07 @registry.py:90][0m 'group3/block0/convshortcut': [1, 1024, ?, ?] --> [1, 2048, ?, ?] [32m[0706 13:50:07 @registry.py:90][0m 'group3/block1/conv1': [1, 2048, ?, ?] --> [1, 512, ?, ?] [32m[0706 13:50:07 @registry.py:90][0m 'group3/block1/conv2': [1, 512, ?, ?] --> [1, 512, ?, ?] [32m[0706 13:50:07 @registry.py:90][0m 'group3/block1/conv3': [1, 512, ?, ?] --> [1, 2048, ?, ?] [32m[0706 13:50:07 @registry.py:90][0m 'group3/block2/conv1': [1, 2048, ?, ?] --> [1, 512, ?, ?] [32m[0706 13:50:07 @registry.py:90][0m 'group3/block2/conv2': [1, 512, ?, ?] --> [1, 512, ?, ?] [32m[0706 13:50:07 @registry.py:90][0m 'group3/block2/conv3': [1, 512, ?, ?] --> [1, 2048, ?, ?] [32m[0706 13:50:07 @registry.py:80][0m 'fpn' input: [1, 256, ?, ?], [1, 512, ?, ?], [1, 1024, ?, ?], [1, 2048, ?, ?] [32m[0706 13:50:07 @registry.py:90][0m 'fpn/lateral_1x1_c2': [1, 256, ?, ?] --> [1, 256, ?, ?] [32m[0706 13:50:07 @registry.py:90][0m 'fpn/lateral_1x1_c3': [1, 512, ?, ?] --> [1, 256, ?, ?] [32m[0706 13:50:07 @registry.py:90][0m 'fpn/lateral_1x1_c4': [1, 1024, ?, ?] --> [1, 256, ?, ?] [32m[0706 13:50:07 @registry.py:90][0m 'fpn/lateral_1x1_c5': [1, 2048, ?, ?] --> [1, 256, ?, ?] [32m[0706 13:50:07 @registry.py:90][0m 'fpn/upsample_lat5': [1, 256, ?, ?] --> [1, 256, ?, ?] [32m[0706 13:50:07 @registry.py:90][0m 'fpn/upsample_lat4': [1, 256, ?, ?] --> [1, 256, ?, ?] [32m[0706 13:50:08 @registry.py:90][0m 'fpn/upsample_lat3': [1, 256, ?, ?] --> [1, 256, ?, ?] [32m[0706 13:50:08 @registry.py:90][0m 'fpn/posthoc_3x3_p2': [1, 256, ?, ?] --> [1, 256, ?, ?] [32m[0706 13:50:08 @registry.py:90][0m 'fpn/posthoc_3x3_p3': [1, 256, ?, ?] --> [1, 256, ?, ?] [32m[0706 13:50:08 @registry.py:90][0m 'fpn/posthoc_3x3_p4': [1, 256, ?, ?] --> [1, 256, ?, ?] [32m[0706 13:50:08 @registry.py:90][0m 'fpn/posthoc_3x3_p5': [1, 256, ?, ?] --> [1, 256, ?, ?] [32m[0706 13:50:08 @registry.py:90][0m 'fpn/maxpool_p6': [1, 256, ?, ?] --> [1, 256, ?, ?] [32m[0706 13:50:08 @registry.py:93][0m 'fpn' output: [1, 256, ?, ?], [1, 256, ?, ?], [1, 256, ?, ?], [1, 256, ?, ?], [1, 256, ?, ?] [32m[0706 13:50:08 @registry.py:80][0m 'rpn' input: [1, 256, ?, ?] [32m[0706 13:50:08 @registry.py:90][0m 'rpn/conv0': [1, 256, ?, ?] --> [1, 256, ?, ?] [32m[0706 13:50:08 @registry.py:90][0m 'rpn/class': [1, 256, ?, ?] --> [1, 3, ?, ?] [32m[0706 13:50:08 @registry.py:90][0m 'rpn/box': [1, 256, ?, ?] --> [1, 12, ?, ?] [32m[0706 13:50:08 @registry.py:93][0m 'rpn' output: [?, ?, 3], [?, ?, 3, 4] [32m[0706 13:50:09 @registry.py:80][0m 'fastrcnn' input: [?, 256, 7, 7] [32m[0706 13:50:10 @registry.py:90][0m 'fastrcnn/fc6': [?, 256, 7, 7] --> [?, 1024] [32m[0706 13:50:10 @registry.py:90][0m 'fastrcnn/fc7': [?, 1024] --> [?, 1024] [32m[0706 13:50:10 @registry.py:93][0m 'fastrcnn' output: [?, 1024] [32m[0706 13:50:10 @registry.py:80][0m 'fastrcnn/outputs' input: [?, 1024] [32m[0706 13:50:10 @registry.py:90][0m 'fastrcnn/outputs/class': [?, 1024] --> [?, 6] [32m[0706 13:50:10 @registry.py:90][0m 'fastrcnn/outputs/box': [?, 1024] --> [?, 24] [32m[0706 13:50:10 @registry.py:93][0m 'fastrcnn/outputs' output: [?, 6], [?, 6, 4] [32m[0706 13:50:10 @regularize.py:97][0m regularize_cost() found 57 variables to regularize. [32m[0706 13:50:10 @regularize.py:21][0m The following tensors will be regularized: group1/block0/conv1/W:0, group1/block0/conv2/W:0, group1/block0/conv3/W:0, group1/block0/convshortcut/W:0, group1/block1/conv1/W:0, group1/block1/conv2/W:0, group1/block1/conv3/W:0, group1/block2/conv1/W:0, group1/block2/conv2/W:0, group1/block2/conv3/W:0, group1/block3/conv1/W:0, group1/block3/conv2/W:0, group1/block3/conv3/W:0, group2/block0/conv1/W:0, group2/block0/conv2/W:0, group2/block0/conv3/W:0, group2/block0/convshortcut/W:0, group2/block1/conv1/W:0, group2/block1/conv2/W:0, group2/block1/conv3/W:0, group2/block2/conv1/W:0, group2/block2/conv2/W:0, group2/block2/conv3/W:0, group2/block3/conv1/W:0, group2/block3/conv2/W:0, group2/block3/conv3/W:0, group2/block4/conv1/W:0, group2/block4/conv2/W:0, group2/block4/conv3/W:0, group2/block5/conv1/W:0, group2/block5/conv2/W:0, group2/block5/conv3/W:0, group3/block0/conv1/W:0, group3/block0/conv2/W:0, group3/block0/conv3/W:0, group3/block0/convshortcut/W:0, group3/block1/conv1/W:0, group3/block1/conv2/W:0, group3/block1/conv3/W:0, group3/block2/conv1/W:0, group3/block2/conv2/W:0, group3/block2/conv3/W:0, fpn/lateral_1x1_c2/W:0, fpn/lateral_1x1_c3/W:0, fpn/lateral_1x1_c4/W:0, fpn/lateral_1x1_c5/W:0, fpn/posthoc_3x3_p2/W:0, fpn/posthoc_3x3_p3/W:0, fpn/posthoc_3x3_p4/W:0, fpn/posthoc_3x3_p5/W:0, rpn/conv0/W:0, rpn/class/W:0, rpn/box/W:0, fastrcnn/fc6/W:0, fastrcnn/fc7/W:0, fastrcnn/outputs/class/W:0, fastrcnn/outputs/box/W:0 [32m[0706 13:50:12 @training.py:108][0m Building graph for training tower 1 on device /gpu:1 ... [32m[0706 13:50:14 @regularize.py:97][0m regularize_cost() found 57 variables to regularize. [32m[0706 13:50:16 @collection.py:152][0m Size of these collections were changed in tower1: (tf.GraphKeys.MODEL_VARIABLES: 161->194) [32m[0706 13:50:16 @collection.py:165][0m These collections were modified but restored in tower1: (tf.GraphKeys.SUMMARIES: 76->77) [32m[0706 13:50:20 @training.py:350][0m 'sync_variables_from_main_tower' includes 607 operations. [32m[0706 13:50:20 @model_utils.py:67][0m [36mList of Trainable Variables: [0mname shape #elements
group1/block0/conv1/W [1, 1, 256, 128] 32768 group1/block0/conv1/bn/gamma [128] 128 group1/block0/conv1/bn/beta [128] 128 group1/block0/conv2/W [3, 3, 128, 128] 147456 group1/block0/conv2/bn/gamma [128] 128 group1/block0/conv2/bn/beta [128] 128 group1/block0/conv3/W [1, 1, 128, 512] 65536 group1/block0/conv3/bn/gamma [512] 512 group1/block0/conv3/bn/beta [512] 512 group1/block0/convshortcut/W [1, 1, 256, 512] 131072 group1/block0/convshortcut/bn/gamma [512] 512 group1/block0/convshortcut/bn/beta [512] 512 group1/block1/conv1/W [1, 1, 512, 128] 65536 group1/block1/conv1/bn/gamma [128] 128 group1/block1/conv1/bn/beta [128] 128 group1/block1/conv2/W [3, 3, 128, 128] 147456 group1/block1/conv2/bn/gamma [128] 128 group1/block1/conv2/bn/beta [128] 128 group1/block1/conv3/W [1, 1, 128, 512] 65536 group1/block1/conv3/bn/gamma [512] 512 group1/block1/conv3/bn/beta [512] 512 group1/block2/conv1/W [1, 1, 512, 128] 65536 group1/block2/conv1/bn/gamma [128] 128 group1/block2/conv1/bn/beta [128] 128 group1/block2/conv2/W [3, 3, 128, 128] 147456 group1/block2/conv2/bn/gamma [128] 128 group1/block2/conv2/bn/beta [128] 128 group1/block2/conv3/W [1, 1, 128, 512] 65536 group1/block2/conv3/bn/gamma [512] 512 group1/block2/conv3/bn/beta [512] 512 group1/block3/conv1/W [1, 1, 512, 128] 65536 group1/block3/conv1/bn/gamma [128] 128 group1/block3/conv1/bn/beta [128] 128 group1/block3/conv2/W [3, 3, 128, 128] 147456 group1/block3/conv2/bn/gamma [128] 128 group1/block3/conv2/bn/beta [128] 128 group1/block3/conv3/W [1, 1, 128, 512] 65536 group1/block3/conv3/bn/gamma [512] 512 group1/block3/conv3/bn/beta [512] 512 group2/block0/conv1/W [1, 1, 512, 256] 131072 group2/block0/conv1/bn/gamma [256] 256 group2/block0/conv1/bn/beta [256] 256 group2/block0/conv2/W [3, 3, 256, 256] 589824 group2/block0/conv2/bn/gamma [256] 256 group2/block0/conv2/bn/beta [256] 256 group2/block0/conv3/W [1, 1, 256, 1024] 262144 group2/block0/conv3/bn/gamma [1024] 1024 group2/block0/conv3/bn/beta [1024] 1024 group2/block0/convshortcut/W [1, 1, 512, 1024] 524288 group2/block0/convshortcut/bn/gamma [1024] 1024 group2/block0/convshortcut/bn/beta [1024] 1024 group2/block1/conv1/W [1, 1, 1024, 256] 262144 group2/block1/conv1/bn/gamma [256] 256 group2/block1/conv1/bn/beta [256] 256 group2/block1/conv2/W [3, 3, 256, 256] 589824 group2/block1/conv2/bn/gamma [256] 256 group2/block1/conv2/bn/beta [256] 256 group2/block1/conv3/W [1, 1, 256, 1024] 262144 group2/block1/conv3/bn/gamma [1024] 1024 group2/block1/conv3/bn/beta [1024] 1024 group2/block2/conv1/W [1, 1, 1024, 256] 262144 group2/block2/conv1/bn/gamma [256] 256 group2/block2/conv1/bn/beta [256] 256 group2/block2/conv2/W [3, 3, 256, 256] 589824 group2/block2/conv2/bn/gamma [256] 256 group2/block2/conv2/bn/beta [256] 256 group2/block2/conv3/W [1, 1, 256, 1024] 262144 group2/block2/conv3/bn/gamma [1024] 1024 group2/block2/conv3/bn/beta [1024] 1024 group2/block3/conv1/W [1, 1, 1024, 256] 262144 group2/block3/conv1/bn/gamma [256] 256 group2/block3/conv1/bn/beta [256] 256 group2/block3/conv2/W [3, 3, 256, 256] 589824 group2/block3/conv2/bn/gamma [256] 256 group2/block3/conv2/bn/beta [256] 256 group2/block3/conv3/W [1, 1, 256, 1024] 262144 group2/block3/conv3/bn/gamma [1024] 1024 group2/block3/conv3/bn/beta [1024] 1024 group2/block4/conv1/W [1, 1, 1024, 256] 262144 group2/block4/conv1/bn/gamma [256] 256 group2/block4/conv1/bn/beta [256] 256 group2/block4/conv2/W [3, 3, 256, 256] 589824 group2/block4/conv2/bn/gamma [256] 256 group2/block4/conv2/bn/beta [256] 256 group2/block4/conv3/W [1, 1, 256, 1024] 262144 group2/block4/conv3/bn/gamma [1024] 1024 group2/block4/conv3/bn/beta [1024] 1024 group2/block5/conv1/W [1, 1, 1024, 256] 262144 group2/block5/conv1/bn/gamma [256] 256 group2/block5/conv1/bn/beta [256] 256 group2/block5/conv2/W [3, 3, 256, 256] 589824 group2/block5/conv2/bn/gamma [256] 256 group2/block5/conv2/bn/beta [256] 256 group2/block5/conv3/W [1, 1, 256, 1024] 262144 group2/block5/conv3/bn/gamma [1024] 1024 group2/block5/conv3/bn/beta [1024] 1024 group3/block0/conv1/W [1, 1, 1024, 512] 524288 group3/block0/conv1/bn/gamma [512] 512 group3/block0/conv1/bn/beta [512] 512 group3/block0/conv2/W [3, 3, 512, 512] 2359296 group3/block0/conv2/bn/gamma [512] 512 group3/block0/conv2/bn/beta [512] 512 group3/block0/conv3/W [1, 1, 512, 2048] 1048576 group3/block0/conv3/bn/gamma [2048] 2048 group3/block0/conv3/bn/beta [2048] 2048 group3/block0/convshortcut/W [1, 1, 1024, 2048] 2097152 group3/block0/convshortcut/bn/gamma [2048] 2048 group3/block0/convshortcut/bn/beta [2048] 2048 group3/block1/conv1/W [1, 1, 2048, 512] 1048576 group3/block1/conv1/bn/gamma [512] 512 group3/block1/conv1/bn/beta [512] 512 group3/block1/conv2/W [3, 3, 512, 512] 2359296 group3/block1/conv2/bn/gamma [512] 512 group3/block1/conv2/bn/beta [512] 512 group3/block1/conv3/W [1, 1, 512, 2048] 1048576 group3/block1/conv3/bn/gamma [2048] 2048 group3/block1/conv3/bn/beta [2048] 2048 group3/block2/conv1/W [1, 1, 2048, 512] 1048576 group3/block2/conv1/bn/gamma [512] 512 group3/block2/conv1/bn/beta [512] 512 group3/block2/conv2/W [3, 3, 512, 512] 2359296 group3/block2/conv2/bn/gamma [512] 512 group3/block2/conv2/bn/beta [512] 512 group3/block2/conv3/W [1, 1, 512, 2048] 1048576 group3/block2/conv3/bn/gamma [2048] 2048 group3/block2/conv3/bn/beta [2048] 2048 fpn/lateral_1x1_c2/W [1, 1, 256, 256] 65536 fpn/lateral_1x1_c2/b [256] 256 fpn/lateral_1x1_c3/W [1, 1, 512, 256] 131072 fpn/lateral_1x1_c3/b [256] 256 fpn/lateral_1x1_c4/W [1, 1, 1024, 256] 262144 fpn/lateral_1x1_c4/b [256] 256 fpn/lateral_1x1_c5/W [1, 1, 2048, 256] 524288 fpn/lateral_1x1_c5/b [256] 256 fpn/posthoc_3x3_p2/W [3, 3, 256, 256] 589824 fpn/posthoc_3x3_p2/b [256] 256 fpn/posthoc_3x3_p3/W [3, 3, 256, 256] 589824 fpn/posthoc_3x3_p3/b [256] 256 fpn/posthoc_3x3_p4/W [3, 3, 256, 256] 589824 fpn/posthoc_3x3_p4/b [256] 256 fpn/posthoc_3x3_p5/W [3, 3, 256, 256] 589824 fpn/posthoc_3x3_p5/b [256] 256 rpn/conv0/W [3, 3, 256, 256] 589824 rpn/conv0/b [256] 256 rpn/class/W [1, 1, 256, 3] 768 rpn/class/b [3] 3 rpn/box/W [1, 1, 256, 12] 3072 rpn/box/b [12] 12 fastrcnn/fc6/W [12544, 1024] 12845056 fastrcnn/fc6/b [1024] 1024 fastrcnn/fc7/W [1024, 1024] 1048576 fastrcnn/fc7/b [1024] 1024 fastrcnn/outputs/class/W [1024, 6] 6144 fastrcnn/outputs/class/b [6] 6 fastrcnn/outputs/box/W [1024, 24] 24576 fastrcnn/outputs/box/b [24] 24[36m Number of trainable variables: 156 Number of parameters (elements): 41147437 Storage space needed for all trainable variables: 156.97MB[0m [32m[0706 13:50:20 @base.py:207][0m Setup callbacks graph ...
/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gradients_util.py:93: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gradients_util.py:93: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " [32m[0706 13:50:27 @argtools.py:138][0m [5m[31mWRN[0m "import prctl" failed! Install python-prctl so that processes can be cleaned with guarantee. [32m[0706 13:50:29 @prof.py:291][0m [HostMemoryTracker] Free RAM in setup_graph() is 364.27 GB. [32m[0706 13:50:29 @tower.py:135][0m Building graph for predict tower 'tower-pred-0' on device /gpu:0 ... [32m[0706 13:50:30 @collection.py:152][0m Size of these collections were changed in tower-pred-0: (tf.GraphKeys.MODEL_VARIABLES: 194->227) [32m[0706 13:50:30 @collection.py:165][0m These collections were modified but restored in tower-pred-0: (tf.GraphKeys.SUMMARIES: 76->77) [32m[0706 13:50:30 @tower.py:135][0m Building graph for predict tower 'tower-pred-1' on device /gpu:1 with variable scope 'tower1'... [32m[0706 13:50:31 @collection.py:152][0m Size of these collections were changed in tower-pred-1: (tf.GraphKeys.MODEL_VARIABLES: 227->260) [32m[0706 13:50:31 @collection.py:165][0m These collections were modified but restored in tower-pred-1: (tf.GraphKeys.SUMMARIES: 76->77) loading annotations into memory... Done (t=0.75s) creating index... index created! [32m[0706 13:50:31 @coco.py:60][0m Instances loaded from /home/vlamp/Documents/STAC/DATA_STAC/coco/annotations/instances_val2017.json.
0%| | 0/9921 [00:00<?, ?it/s] 100%|##########| 9921/9921 [00:00<00:00, 725119.19it/s][32m[0706 13:50:31 @timer.py:45][0m Load annotations for instances_val2017.json finished, time:0.0151 sec. [32m[0706 13:50:31 @data.py:456][0m Found 9921 images for inference. loading annotations into memory... Done (t=0.83s) creating index... index created! [32m[0706 13:50:32 @coco.py:60][0m Instances loaded from /home/vlamp/Documents/STAC/DATA_STAC/coco/annotations/instances_val2017.json.
0%| | 0/9921 [00:00<?, ?it/s] 100%|##########| 9921/9921 [00:00<00:00, 739211.43it/s][32m[0706 13:50:32 @timer.py:45][0m Load annotations for instances_val2017.json finished, time:0.0150 sec. [32m[0706 13:50:32 @data.py:456][0m Found 9921 images for inference. loading annotations into memory... Done (t=0.82s) creating index... index created! [32m[0706 13:50:33 @coco.py:60][0m Instances loaded from /home/vlamp/Documents/STAC/DATA_STAC/coco/annotations/instances_val2017.json.
0%| | 0/9921 [00:00<?, ?it/s] 100%|##########| 9921/9921 [00:00<00:00, 744062.40it/s][32m[0706 13:50:33 @timer.py:45][0m Load annotations for instances_val2017.json finished, time:0.0149 sec. [32m[0706 13:50:33 @data.py:456][0m Found 9921 images for inference. loading annotations into memory... Done (t=0.77s) creating index... index created! [32m[0706 13:50:34 @coco.py:60][0m Instances loaded from /home/vlamp/Documents/STAC/DATA_STAC/coco/annotations/instances_val2017.json.
0%| | 0/9921 [00:00<?, ?it/s] 100%|##########| 9921/9921 [00:00<00:00, 713481.88it/s][32m[0706 13:50:34 @timer.py:45][0m Load annotations for instances_val2017.json finished, time:0.0153 sec. [32m[0706 13:50:34 @data.py:456][0m Found 9921 images for inference. [32m[0706 13:50:34 @summary.py:47][0m [MovingAverageSummary] 73 operations in collection 'MOVING_SUMMARY_OPS' will be run with session hooks. [32m[0706 13:50:34 @summary.py:94][0m Summarizing collection 'summaries' of size 76. [32m[0706 13:50:34 @base.py:228][0m Creating the session ... 2020-07-06 13:50:34.737615: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA 2020-07-06 13:50:34.743032: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1 2020-07-06 13:50:34.887781: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x14c78d20 executing computations on platform CUDA. Devices: 2020-07-06 13:50:34.887822: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): Tesla T4, Compute Capability 7.5 2020-07-06 13:50:34.887827: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (1): Tesla T4, Compute Capability 7.5 2020-07-06 13:50:34.890055: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2494125000 Hz 2020-07-06 13:50:34.893901: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x14a0c4f0 executing computations on platform Host. Devices: 2020-07-06 13:50:34.893919: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0):,
2020-07-06 13:50:34.896069: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:3b:00.0Could not dlopen library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/cm/shared/apps/slur
2020-07-06 13:50:34.896771: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 1 with properties:
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:d8:00.0
2020-07-06 13:50:34.897783: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] m/current/lib64:/cm/shared/apps/slurm/current/lib64/slurm:/.singularity.d/libs:/usr/local/cuda-10.0/lib64/
2020-07-06 13:50:34.898069: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/cm/shared/apps/slurm/current/lib64:/cm/shared/apps/slurm/current/lib64/slurm:/.singularity.d/libs:/usr/local/cuda-10.0/lib64/
2020-07-06 13:50:34.898242: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/cm/shared/apps/slurm/current/lib64:/cm/shared/apps/slurm/current/lib64/slurm:/.singularity.d/libs:/usr/local/cuda-10.0/lib64/
2020-07-06 13:50:34.898401: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/cm/shared/apps/slurm/current/lib64:/cm/shared/apps/slurm/current/lib64/slurm:/.singularity.d/libs:/usr/local/cuda-10.0/lib64/
2020-07-06 13:50:34.898538: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/cm/shared/apps/slurm/current/lib64:/cm/shared/apps/slurm/current/lib64/slurm:/.singularity.d/libs:/usr/local/cuda-10.0/lib64/
2020-07-06 13:50:34.898705: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/cm/shared/apps/slurm/current/lib64:/cm/shared/apps/slurm/current/lib64/slurm:/.singularity.d/libs:/usr/local/cuda-10.0/lib64/
2020-07-06 13:50:34.901746: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2020-07-06 13:50:34.901764: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1663] Cannot dlopen some GPU libraries. Skipping registering GPU devices...
2020-07-06 13:50:34.901834: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-07-06 13:50:34.901840: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0 1
2020-07-06 13:50:34.901845: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N Y
2020-07-06 13:50:34.901848: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 1: Y N
MultiProcessMapDataZMQ successfully cleaned-up. Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1356, in _do_call return fn(*args) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1339, in _run_fn self._extend_graph() File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1374, in _extend_graph tf_session.ExtendSession(self._session) tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'NcclAllReduce' used by {{node AllReduceGrads/NcclAllReduce}}with these attrs: [shared_name="c0", T=DT_FLOAT, num_devices=2, reduction="sum"] Registered devices: [CPU, XLA_CPU, XLA_GPU] Registered kernels: device='GPU'
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/vlamp/Documents/STAC/detection/train_stg1_bdd.py", line 180, in
launch_train_with_config(traincfg, trainer)
File "/usr/local/lib/python3.6/dist-packages/tensorpack/train/interface.py", line 99, in launch_train_with_config
extra_callbacks=config.extra_callbacks)
File "/usr/local/lib/python3.6/dist-packages/tensorpack/train/base.py", line 342, in train_with_defaults
steps_per_epoch, starting_epoch, max_epoch)
File "/usr/local/lib/python3.6/dist-packages/tensorpack/train/base.py", line 313, in train
self.initialize(session_creator, session_init)
File "/usr/local/lib/python3.6/dist-packages/tensorpack/utils/argtools.py", line 168, in wrapper
return func(*args, *kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorpack/train/tower.py", line 147, in initialize
super(TowerTrainer, self).initialize(session_creator, session_init)
File "/usr/local/lib/python3.6/dist-packages/tensorpack/utils/argtools.py", line 168, in wrapper
return func(args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorpack/train/base.py", line 230, in initialize
self.sess = session_creator.create_session()
File "/usr/local/lib/python3.6/dist-packages/tensorpack/tfutils/sesscreate.py", line 88, in create_session
run(tf.global_variables_initializer())
File "/usr/local/lib/python3.6/dist-packages/tensorpack/tfutils/sesscreate.py", line 86, in run
sess.run(op)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 950, in run
run_metadata_ptr)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1173, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1350, in _do_run
run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1370, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'NcclAllReduce' used by node AllReduceGrads/NcclAllReduce (defined at usr/local/lib/python3.6/dist-packages/tensorpack/graph_builder/utils.py:154) with these attrs: [shared_name="c0", T=DT_FLOAT, num_devices=2, reduction="sum"]
Registered devices: [CPU, XLA_CPU, XLA_GPU]
Registered kernels:
device='GPU'
Errors may have originated from an input operation. Input Source operations connected to node AllReduceGrads/NcclAllReduce: tower0/gradients/AddN_126 (defined at usr/local/lib/python3.6/dist-packages/tensorpack/tfutils/optimizer.py:29) /cm/local/apps/slurm/var/spool/job18434303/slurm_script: line 29: t: command not found