Closed Hwijune closed 4 years ago
@Hwijune see https://github.com/ultralytics/yolov3/issues/725, may be related.
@Hwijune see #725, may be related.
@glenn-jocher thank you, i will test
#from torch.utils.tensorboard import SummaryWriter
@Hwijune yes that's a good idea. Did that produce any change for you?
@Hwijune yes that's a good idea. Did that produce any change for you?
I passed the 1 epoch, but it was stopped.
I think it should be saved every 1000 iteration.
so that it can continue even if it stops.
@Hwijune you can add this functionality by copying this code and placing it within your own logic: https://github.com/ultralytics/yolov3/blob/8b6c8a53182b2415fd61459fc9a0ccbdef8dc904/train.py#L341-L354
Multi-GPU works well in VM instances we've tested using V100's, but we have not tested multi-GPU using 2080ti cards, so its possible the problem may be originating there. Do you see it when running a single GPU?
@Hwijune you can add this functionality by copying this code and placing it within your own logic: https://github.com/ultralytics/yolov3/blob/8b6c8a53182b2415fd61459fc9a0ccbdef8dc904/train.py#L341-L354
Multi-GPU works well in VM instances we've tested using V100's, but we have not tested multi-GPU using 2080ti cards, so its possible the problem may be originating there. Do you see it when running a single GPU?
@glenn-jocher thanks. I'll test it
@Hwijune I have the same setup as you (e.g., 4 2080ti GPU's) and am seeing similar errors. If I reduce the batch dramatically (down to 4), I can get it to run. But even so, it often crashes somewhat randomly. I would like to train on larger batch sizes and higher-res images if possible. Curious if you have made any progress you can share?
@Hwijune did commenting out the TF summarywriter fix the issue for you?
@Hwijune did commenting out the TF summarywriter fix the issue for you?
@glenn-jocher @porterjenkins
#from torch.utils.tensorboard import SummaryWriter
TF Summary test didn't do well.
it's same. I don't know what to do yet.
@Hwijune I had similar issues on a multi 2080ti machine, ended up "solving" it by just training on a single GPU, i.e. train.py --device 0
.
@Hwijune I had similar issues on a multi 2080ti machine, ended up "solving" it by just training on a single GPU, i.e.
train.py --device 0
.
How can I use multi gpus?
When tested, the time was similar regardless of the number of gpu.
1epoch -> 2, 4, 8 gpus
@Hwijune multi-gpu:
python3 train.py --device 0,1
python3 train.py --device 0,1,2,3
etc.
If 2/4/8 GPU epoch takes same amount of time you are dataloader limited, need faster SSD, more CPU cores, etc. For fastest dataloading cache all images:
python3 train.py --cache
@Hwijune 멀티 GPU :
python3 train.py-장치 0,1 python3 train.py-장치 0,1,2,3 기타
2/4/8 GPU epoch가 데이터 로더 제한 시간과 동일한 시간이 걸리면 더 빠른 SSD, 더 많은 CPU 코어 등이 필요합니다. 가장 빠른 데이터로드를 위해 모든 이미지를 캐시하십시오.
python3 train.py --cache
Server specification
ddr4 256g ram 1.8T ssd intel xeon silver 4214 cpu @ 2.20ghz
nw = min([os.cpu_count(), batch_size if batch_size > 1 else 0, 8]) # number of workers
adjust the maximum value of the data loader?
@Hwijune
Number of workers
depends on the number of GPUs you have.
Pytorch developers suggest the following workers = (4 * GPU_count)
@Hwijune Number of workers depends on the number of GPUs you have. Pytorch developers suggest the following workers = (4 * GPU_count)
@Lornatang oh good. i'll test it
workers = (4 * GPU_count)
This issue is stale because it has been open 30 days with no activity. Remove Stale label or comment or this will be closed in 5 days.
Less frequent after installing the latest version.
However, errors continue to occur
@Hwijune, sorry to hear that the issue persists. It might be beneficial to post the specific error output so that the community can assist in diagnosing the problem.
hi, @glenn-jocher
There is always a segmentation fault error.
Sometimes it's over 1 epoch, but it's mostly stopped.
How can I fix it?
or Before I become 1 epoch, i want to save in schedule iteration
My environment = pytorch 1.4.0, python 3.7.6, nvidia driver 440, cuda 10.2, opencv 4.1.2
Using CUDA Apex device0 _CudaDeviceProperties(name='GeForce RTX 2080 Ti', total_memory=11019MB) device1 _CudaDeviceProperties(name='GeForce RTX 2080 Ti', total_memory=11019MB) device2 _CudaDeviceProperties(name='GeForce RTX 2080 Ti', total_memory=11019MB) device3 _CudaDeviceProperties(name='GeForce RTX 2080 Ti', total_memory=11019MB) device4 _CudaDeviceProperties(name='GeForce RTX 2080 Ti', total_memory=11019MB) device5 _CudaDeviceProperties(name='GeForce RTX 2080 Ti', total_memory=11019MB) device6 _CudaDeviceProperties(name='GeForce RTX 2080 Ti', total_memory=11019MB) device7 _CudaDeviceProperties(name='GeForce RTX 2080 Ti', total_memory=11019MB)
Model Summary: 225 layers, 6.38728e+07 parameters, 6.38728e+07 gradients Using 8 dataloader workers Starting training for 5 epochs...