ultralytics / yolov3

YOLOv3 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
10.18k stars 3.44k forks source link

Segmentation fault (core dumped) #891

Closed Hwijune closed 4 years ago

Hwijune commented 4 years ago

hi, @glenn-jocher

There is always a segmentation fault error.

Sometimes it's over 1 epoch, but it's mostly stopped.

How can I fix it?

or Before I become 1 epoch, i want to save in schedule iteration

My environment = pytorch 1.4.0, python 3.7.6, nvidia driver 440, cuda 10.2, opencv 4.1.2

 Namespace(accumulate=1, adam=False, arc='default', batch_size=64, bucket='', cache_images=False, cfg='/home/phj8498/darknet/cfg/yolov3-spp3-mid-mask.cfg', data=   '/home/phj8498/darknet/data/3class_mid.data', device='0,1,2,3,4,5,6,7', epochs=5, evolve=False, img_size=[608], multi_scale=False, name='', nosave=False, notest   ='True', rect='True', resume='True', single_cls=False, var=None, weights='weights/last.pt')

Using CUDA Apex device0 _CudaDeviceProperties(name='GeForce RTX 2080 Ti', total_memory=11019MB) device1 _CudaDeviceProperties(name='GeForce RTX 2080 Ti', total_memory=11019MB) device2 _CudaDeviceProperties(name='GeForce RTX 2080 Ti', total_memory=11019MB) device3 _CudaDeviceProperties(name='GeForce RTX 2080 Ti', total_memory=11019MB) device4 _CudaDeviceProperties(name='GeForce RTX 2080 Ti', total_memory=11019MB) device5 _CudaDeviceProperties(name='GeForce RTX 2080 Ti', total_memory=11019MB) device6 _CudaDeviceProperties(name='GeForce RTX 2080 Ti', total_memory=11019MB) device7 _CudaDeviceProperties(name='GeForce RTX 2080 Ti', total_memory=11019MB)

 Caching labels (1.63932e+06 found, 0 missing, 85598 empty, 68 duplicate, for 1.72492e+06 images): 100%|████████████| 1724917/1724917 [02:29<00:00, 11543.61it/s]

 Caching labels (3053 found, 0 missing, 1544 empty, 0 duplicate, for 4597 images): 100%|██████████████████████████████████| 4597/4597 [00:00<00:00, 14925.84it/s]

Model Summary: 225 layers, 6.38728e+07 parameters, 6.38728e+07 gradients Using 8 dataloader workers Starting training for 5 epochs...

 Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size

 1/4     6.19G      2.82     0.779      5.83      9.43       284       608:  54%|████████████████▊              | 14644/26952 [4:23:07<3:45:40,  1.10s/it] Segmentation fault (core dumped)
glenn-jocher commented 4 years ago

@Hwijune see https://github.com/ultralytics/yolov3/issues/725, may be related.

Hwijune commented 4 years ago

@Hwijune see #725, may be related.

@glenn-jocher thank you, i will test

 #from torch.utils.tensorboard import SummaryWriter
glenn-jocher commented 4 years ago

@Hwijune yes that's a good idea. Did that produce any change for you?

Hwijune commented 4 years ago

@Hwijune yes that's a good idea. Did that produce any change for you?

I passed the 1 epoch, but it was stopped.

I think it should be saved every 1000 iteration.

so that it can continue even if it stops.

glenn-jocher commented 4 years ago

@Hwijune you can add this functionality by copying this code and placing it within your own logic: https://github.com/ultralytics/yolov3/blob/8b6c8a53182b2415fd61459fc9a0ccbdef8dc904/train.py#L341-L354

Multi-GPU works well in VM instances we've tested using V100's, but we have not tested multi-GPU using 2080ti cards, so its possible the problem may be originating there. Do you see it when running a single GPU?

Hwijune commented 4 years ago

@Hwijune you can add this functionality by copying this code and placing it within your own logic: https://github.com/ultralytics/yolov3/blob/8b6c8a53182b2415fd61459fc9a0ccbdef8dc904/train.py#L341-L354

Multi-GPU works well in VM instances we've tested using V100's, but we have not tested multi-GPU using 2080ti cards, so its possible the problem may be originating there. Do you see it when running a single GPU?

@glenn-jocher thanks. I'll test it

  1. single 2080ti gpu test
  2. change train.py code test
porterjenkins commented 4 years ago

@Hwijune I have the same setup as you (e.g., 4 2080ti GPU's) and am seeing similar errors. If I reduce the batch dramatically (down to 4), I can get it to run. But even so, it often crashes somewhat randomly. I would like to train on larger batch sizes and higher-res images if possible. Curious if you have made any progress you can share?

glenn-jocher commented 4 years ago

@Hwijune did commenting out the TF summarywriter fix the issue for you?

Hwijune commented 4 years ago

@Hwijune did commenting out the TF summarywriter fix the issue for you?

@glenn-jocher @porterjenkins

#from torch.utils.tensorboard import SummaryWriter

TF Summary test didn't do well.

it's same. I don't know what to do yet.

glenn-jocher commented 4 years ago

@Hwijune I had similar issues on a multi 2080ti machine, ended up "solving" it by just training on a single GPU, i.e. train.py --device 0.

Hwijune commented 4 years ago

@Hwijune I had similar issues on a multi 2080ti machine, ended up "solving" it by just training on a single GPU, i.e. train.py --device 0.

How can I use multi gpus?

When tested, the time was similar regardless of the number of gpu.

1epoch -> 2, 4, 8 gpus

glenn-jocher commented 4 years ago

@Hwijune multi-gpu:

python3 train.py --device 0,1
python3 train.py --device 0,1,2,3
etc.

If 2/4/8 GPU epoch takes same amount of time you are dataloader limited, need faster SSD, more CPU cores, etc. For fastest dataloading cache all images: python3 train.py --cache

Hwijune commented 4 years ago

@Hwijune 멀티 GPU :

python3 train.py-장치 0,1
python3 train.py-장치 0,1,2,3
기타

2/4/8 GPU epoch가 데이터 로더 제한 시간과 동일한 시간이 걸리면 더 빠른 SSD, 더 많은 CPU 코어 등이 필요합니다. 가장 빠른 데이터로드를 위해 모든 이미지를 캐시하십시오. python3 train.py --cache

Server specification

ddr4 256g ram 1.8T ssd intel xeon silver 4214 cpu @ 2.20ghz

 nw = min([os.cpu_count(), batch_size if batch_size > 1 else 0, 8])  # number of workers

adjust the maximum value of the data loader?

Lornatang commented 4 years ago

@Hwijune Number of workers depends on the number of GPUs you have. Pytorch developers suggest the following workers = (4 * GPU_count)

Hwijune commented 4 years ago

@Hwijune Number of workers depends on the number of GPUs you have. Pytorch developers suggest the following workers = (4 * GPU_count)

@Lornatang oh good. i'll test it

workers = (4 * GPU_count)
github-actions[bot] commented 4 years ago

This issue is stale because it has been open 30 days with no activity. Remove Stale label or comment or this will be closed in 5 days.

Hwijune commented 4 years ago

Less frequent after installing the latest version.

However, errors continue to occur

glenn-jocher commented 11 months ago

@Hwijune, sorry to hear that the issue persists. It might be beneficial to post the specific error output so that the community can assist in diagnosing the problem.