lfz / DSB2017

The solution of team 'grt123' in DSB2017
MIT License
1.23k stars 420 forks source link

cuda out of memory err: when my gpu memory still has 4G left #78

Closed abbyQu closed 6 years ago

abbyQu commented 6 years ago

when doing the testing ,i watched the gpu memory, found that the script exited when the memory usage got 2G. my gpu is 1060 with 6G memory ,how did that come?

System Info

RuntimeError: cuda runtime err(2): out of memory at /opt/conda/conda-bld/pytorch_1501953625411/work/pytorch-0.1.12/torch/libTHC/THCstorage.cu:66

PyTorch or Caffe2: How you installed PyTorch:conda, Build command you used (if compiling from source): OS: Ubuntu 14.04 PyTorch version: pytorch 0.1.10 Python version: 2.7 CUDA/cuDNN version: 8.0/5.1 GCC version (if compiling from source): 4.9

lfz commented 6 years ago

train or test? please try test first

and try to reduce the batch size

abbyQu commented 6 years ago

thank you ! I was executing the testing : pyhon main.py

batch_size =1

this is my config_submit config = {'datapath': '/home/qrf/DSB2017-master/test', 'preprocess_result_path': './prep_result/', 'outputfile': 'prediction.csv',

      'detector_model': 'net_detector',
      'detector_param': './model/detector.ckpt',
      'classifier_model': 'net_classifier',
      'classifier_param': './model/classifier.ckpt',
      'n_gpu':1,
      'n_worker_preprocessing':2,
      'use_exsiting_preprocessing': False,
      'skip_preprocessing': False,
      'skip_detect': False}
abbyQu commented 6 years ago

i traced the memory usage, in the pre-processing,no gpu memory was used . and when it printed "end preprocessing" , the usage increase slowly to 239M ,and suddenly increased to 2G ,an exit with the err "out of memory"

lfz commented 6 years ago

batchsize is controled by console parameter -b, see main.py

2018-05-17 14:44 GMT+08:00 abbyQu notifications@github.com:

i traced the memory usage, in the pre-processing,no gpu memory was used . and when it printed "end preprocessing" , the usage increase slowly to 200M ,and suddenly increased to 2G ,an exit with the err "out of memory"

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lfz/DSB2017/issues/78#issuecomment-389762028, or mute the thread https://github.com/notifications/unsubscribe-auth/AIigQxR6R4nq6SKpbmxB7ho2htRPtnNUks5tzRw4gaJpZM4UCTfQ .

-- 廖方舟 清华大学医学院 Liao Fangzhou School of Medicine Tsinghua University Beijing 100084 China

abbyQu commented 6 years ago

thanks a lot …I haven't find any parameter called "b' or "-b" in the project. i was wondering if the 1060 itself can not Carry such a large amount of calculation. Can you give any clue about the basic requirement of the testing? Does gtx1080 can make it ? Thanks!

lfz commented 6 years ago

https://github.com/lfz/DSB2017/blob/master/training/detector/main.py#L32

abbyQu commented 6 years ago

thank you!!!

huangmozhilv commented 6 years ago

Hi @lfz, My platform is Debian 8, 4 GeForce GTX 1080 GPU, each has a memory about 8G. In addition, I have 12 CPU threads. To run python main.py in root folder, I set test_loader = DataLoader(dataset,batch_size = 1, shuffle = False,num_workers = 10, pin_memory=False,collate_fn =collate). The problem occurs again. Could you give any clues to fix the issue? Thanks in advance.

lfz commented 6 years ago

could you give me the entire command and path you use, and the output log

2018-05-28 20:48 GMT+08:00 ccHuang notifications@github.com:

Hi @lfz https://github.com/lfz, My platform is Debian 8, 4 GeForce GTX 1080 GPU, each has a memory about 8G. In addition, I have 12 CPU threads. To run python main.py in root folder, I set test_loader = DataLoader(dataset,batch_size = 1, shuffle = False,num_workers = 10, pin_memory=False,collate_fn =collate). The problem occurs again. Could you give any clues to fix the issue? Thanks in advance.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lfz/DSB2017/issues/78#issuecomment-392518308, or mute the thread https://github.com/notifications/unsubscribe-auth/AIigQ5L2xTBFtlv1_S2AET8ukCdXa-kBks5t2_IBgaJpZM4UCTfQ .

-- 廖方舟 清华大学医学院 Liao Fangzhou School of Medicine Tsinghua University Beijing 100084 China

huangmozhilv commented 6 years ago

My test data contains only two samples from stage1, here's a screenshot of the root folder and the command. image

lfz commented 6 years ago

hi, please try to print the shape of "input" before line 52 of test_detect.py to make sure that the shape is 1x1x128x128x128

if it is, please try to reduce the "sidelen" in line 50 of main.py

huangmozhilv commented 6 years ago

Thank you. My input size is not the same: image

According to your code, I think the first dimension depends on the GPU numbers. What do other dimensions mean? How to change that to get a size of 1x1x128x128x128?

lfz commented 6 years ago

oh, it's ok, it should be 208, 128 is the cube size when training. When testing, the size is 208

208 = 144+2*32

make sure that all your 4 gpus are used properly, change "ngpu" to 1 to test it

if all your gpus works correctly, try to change the sidelen to a smaller number

huangmozhilv commented 6 years ago

It works after reducing sidelen to 112, but the prediction seems not accurate. image

lfz commented 6 years ago

Please run all cases to see the overall score, it might be a bad case for every one.

To be more clear, the final score 0.4 is not a high score, the overall accuracy is just above 80%, which is based on a chance level of 70%

2018-06-06 10:18 GMT+08:00 ccHuang notifications@github.com:

It works after reducing sidelen to 112, but the prediction seems not accurate. [image: image] https://user-images.githubusercontent.com/26662685/41012177-f2b41136-6972-11e8-894e-1a251ff66523.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lfz/DSB2017/issues/78#issuecomment-394918424, or mute the thread https://github.com/notifications/unsubscribe-auth/AIigQ10yQR69-8tmkMJ_guf-HKxkHfiWks5t5zwEgaJpZM4UCTfQ .

-- 廖方舟 清华大学医学院 Liao Fangzhou School of Medicine Tsinghua University Beijing 100084 China

wjx2 commented 6 years ago

@lfz Hi, I found that in training ,the input size is 128128128, but in testing,the input size is 208208208. Why the train and test size is different? In my knowledge, I think the train and test size should be the same. Thanks in advance.