open-mmlab / mmdetection

OpenMMLab Detection Toolbox and Benchmark
https://mmdetection.readthedocs.io
Apache License 2.0
29.62k stars 9.47k forks source link

Default validation hook fails on multi-node training cluster #3424

Closed vdabravolski closed 4 years ago

vdabravolski commented 4 years ago

Thanks for your error report and we appreciate it a lot.

Checklist

  1. I have searched related issues but cannot get the expected help.
  2. The bug has not been fixed in the latest version.

Describe the bug I'm running distributed training on Amazon Sagemaker (AWS ML service). Ideally, I'd like to run training of COCO2017 from scratch on 4 nodes of P3.16xlarge (each with 8 GPUs).

After starting the training (each training node invokes command below), the training process goes as expected and I see that the model is training succesfully. However, after training completed and script tries to run validation, it fails with following stack trace (see below stacktrace section).

I suspect that the error is caused by the fact that validation hook is not adopted for multi-node environment. Hence, it cannot find validation outcomes for training processes outside of first node.

If that's the case, i'd like to see what approach I can take to run validation in multi-node environment. Do I need to create custom validation hook?

Reproduction

  1. What command or script did you run? See below training container. Each container kicks off training with command: python -m torch.distributed.launch --nnodes 4 --node_rank 0 --nproc_per_node 8 --master_addr algo-1 --master_port 55555 /opt/ml/code/mmdetection/tools/train.py /opt/ml/code/updated_config.py --launcher pytorch --work-dir /opt/ml/output

  2. Did you make any modifications on the code or config? Did you understand what you have modified? I used configs/mask_rcnn/mask_rcnn_r50_fpn_1x_coco.py with only modification to decrease number of epochs to 1 (in order to speed up testing cycles). I used "--options" in tools/train.py to override default number of training epochs.

  3. What dataset did you use? COCO2017 training and validation only.

Environment

  1. Please run python mmdet/utils/collect_env.py to collect necessary environment information and paste it here.
  2. You may add addition that may be helpful for locating the problem, such as
    • How you installed PyTorch [e.g., pip, conda, source]
    • Other environment variables that may be related (such as $PATH, $LD_LIBRARY_PATH, $PYTHONPATH, etc.)

See dockerfile used for training:

# Use Sagemaker PyTorch container as base image
# https://github.com/aws/sagemaker-pytorch-container/blob/master/docker/1.5.0/py3/Dockerfile.gpu
FROM 763104351884.dkr.ecr.us-east-2.amazonaws.com/pytorch-training:1.5.0-gpu-py36-cu101-ubuntu16.04
LABEL author="vadimd@amazon.com"

############# Installing MMDetection from source ############

WORKDIR /opt/ml/code
RUN pip install --upgrade --force-reinstall  torch torchvision cython
RUN pip install mmcv-full==latest+torch1.5.0+cu101 -f https://openmmlab.oss-accelerate.aliyuncs.com/mmcv/dist/index.html

RUN git clone https://github.com/open-mmlab/mmdetection
RUN cd mmdetection/ && \
    pip install -e .

# to address https://github.com/pytorch/pytorch/issues/37377
ENV MKL_THREADING_LAYER GNU
ENV MMDETECTION /opt/ml/code/mmdetection

############# Configuring Sagemaker ##############
COPY container_training /opt/ml/code

ENV SAGEMAKER_SUBMIT_DIRECTORY /opt/ml/code
ENV SAGEMAKER_PROGRAM mmdetection_train.py

WORKDIR /

Error traceback If applicable, paste the error trackback here.


2020-07-27T12:53:53.971-04:00 | 2020-07-27 16:53:53,080 - mmdet - INFO - Epoch [1][50/1833]#011lr: 3.956e-03, eta: 0:44:16, time: 1.490, data_time: 0.864, memory: 4248, loss_rpn_cls: 0.4857, loss_rpn_bbox: 0.1058, loss_cls: 1.2197, acc: 87.8541, loss_bbox: 0.0982, loss_mask: 0.7520, loss: 2.6615
-- | --
  | 2020-07-27T12:54:25.004-04:00 | 2020-07-27 16:54:24,289 - mmdet - INFO - Epoch [1][100/1833]#011lr: 7.952e-03, eta: 0:30:32, time: 0.624, data_time: 0.074, memory: 4266, loss_rpn_cls: 0.1839, loss_rpn_bbox: 0.0986, loss_cls: 0.5224, acc: 93.7102, loss_bbox: 0.2167, loss_mask: 0.6856, loss: 1.7071
  | 2020-07-27T12:54:57.029-04:00 | 2020-07-27 16:54:56,318 - mmdet - INFO - Epoch [1][150/1833]#011lr: 1.195e-02, eta: 0:25:43, time: 0.637, data_time: 0.062, memory: 4339, loss_rpn_cls: 0.1365, loss_rpn_bbox: 0.0938, loss_cls: 0.5116, acc: 92.4256, loss_bbox: 0.2711, loss_mask: 0.6411, loss: 1.6541
  | 2020-07-27T12:55:29.061-04:00 | 2020-07-27 16:55:28,801 - mmdet - INFO - Epoch [1][200/1833]#011lr: 1.594e-02, eta: 0:23:08, time: 0.650, data_time: 0.076, memory: 4427, loss_rpn_cls: 0.1147, loss_rpn_bbox: 0.0933, loss_cls: 0.5229, acc: 91.0609, loss_bbox: 0.3305, loss_mask: 0.5951, loss: 1.6565
  | 2020-07-27T12:56:01.105-04:00 | 2020-07-27 16:56:01,096 - mmdet - INFO - Epoch [1][250/1833]#011lr: 1.994e-02, eta: 0:21:22, time: 0.650, data_time: 0.073, memory: 4427, loss_rpn_cls: 0.1050, loss_rpn_bbox: 0.0882, loss_cls: 0.5459, acc: 90.2491, loss_bbox: 0.3622, loss_mask: 0.5571, loss: 1.6584
  | 2020-07-27T12:56:34.141-04:00 | 2020-07-27 16:56:33,369 - mmdet - INFO - Epoch [1][300/1833]#011lr: 2.394e-02, eta: 0:19:59, time: 0.645, data_time: 0.051, memory: 4427, loss_rpn_cls: 0.0941, loss_rpn_bbox: 0.0843, loss_cls: 0.5474, acc: 89.8784, loss_bbox: 0.3770, loss_mask: 0.5166, loss: 1.6193
  | 2020-07-27T12:57:06.158-04:00 | 2020-07-27 16:57:06,005 - mmdet - INFO - Epoch [1][350/1833]#011lr: 2.793e-02, eta: 0:18:53, time: 0.653, data_time: 0.058, memory: 4427, loss_rpn_cls: 0.0891, loss_rpn_bbox: 0.0851, loss_cls: 0.5413, acc: 89.5242, loss_bbox: 0.3876, loss_mask: 0.4835, loss: 1.5866
  | 2020-07-27T12:57:39.182-04:00 | 2020-07-27 16:57:38,628 - mmdet - INFO - Epoch [1][400/1833]#011lr: 3.193e-02, eta: 0:17:54, time: 0.652, data_time: 0.061, memory: 4429, loss_rpn_cls: 0.0857, loss_rpn_bbox: 0.0799, loss_cls: 0.5016, acc: 89.5015, loss_bbox: 0.3774, loss_mask: 0.4631, loss: 1.5076
  | 2020-07-27T12:58:11.222-04:00 | 2020-07-27 16:58:11,196 - mmdet - INFO - Epoch [1][450/1833]#011lr: 3.592e-02, eta: 0:17:02, time: 0.651, data_time: 0.079, memory: 4429, loss_rpn_cls: 0.0807, loss_rpn_bbox: 0.0797, loss_cls: 0.4778, acc: 89.4353, loss_bbox: 0.3698, loss_mask: 0.4412, loss: 1.4492
  | 2020-07-27T12:58:44.249-04:00 | 2020-07-27 16:58:43,800 - mmdet - INFO - Epoch [1][500/1833]#011lr: 3.992e-02, eta: 0:16:13, time: 0.652, data_time: 0.054, memory: 4429, loss_rpn_cls: 0.0786, loss_rpn_bbox: 0.0784, loss_cls: 0.4712, acc: 89.4401, loss_bbox: 0.3598, loss_mask: 0.4271, loss: 1.4152
  | 2020-07-27T12:59:17.265-04:00 | 2020-07-27 16:59:16,266 - mmdet - INFO - Epoch [1][550/1833]#011lr: 4.000e-02, eta: 0:15:27, time: 0.650, data_time: 0.075, memory: 4429, loss_rpn_cls: 0.0783, loss_rpn_bbox: 0.0779, loss_cls: 0.4464, acc: 89.6727, loss_bbox: 0.3518, loss_mask: 0.4175, loss: 1.3719
  | 2020-07-27T12:59:49.278-04:00 | 2020-07-27 16:59:48,655 - mmdet - INFO - Epoch [1][600/1833]#011lr: 4.000e-02, eta: 0:14:43, time: 0.648, data_time: 0.075, memory: 4429, loss_rpn_cls: 0.0732, loss_rpn_bbox: 0.0759, loss_cls: 0.4224, acc: 89.7290, loss_bbox: 0.3440, loss_mask: 0.4122, loss: 1.3277
  | 2020-07-27T13:00:21.302-04:00 | 2020-07-27 17:00:21,202 - mmdet - INFO - Epoch [1][650/1833]#011lr: 4.000e-02, eta: 0:14:01, time: 0.651, data_time: 0.066, memory: 4435, loss_rpn_cls: 0.0718, loss_rpn_bbox: 0.0770, loss_cls: 0.4118, acc: 89.6956, loss_bbox: 0.3463, loss_mask: 0.4005, loss: 1.3074
  | 2020-07-27T13:00:54.335-04:00 | 2020-07-27 17:00:53,932 - mmdet - INFO - Epoch [1][700/1833]#011lr: 4.000e-02, eta: 0:13:21, time: 0.655, data_time: 0.082, memory: 4435, loss_rpn_cls: 0.0733, loss_rpn_bbox: 0.0754, loss_cls: 0.4049, acc: 89.6129, loss_bbox: 0.3482, loss_mask: 0.3948, loss: 1.2966
  | 2020-07-27T13:01:27.370-04:00 | 2020-07-27 17:01:26,786 - mmdet - INFO - Epoch [1][750/1833]#011lr: 4.000e-02, eta: 0:12:42, time: 0.657, data_time: 0.079, memory: 4435, loss_rpn_cls: 0.0704, loss_rpn_bbox: 0.0760, loss_cls: 0.3889, acc: 89.6921, loss_bbox: 0.3421, loss_mask: 0.3874, loss: 1.2649
  | 2020-07-27T13:01:59.414-04:00 | 2020-07-27 17:01:59,364 - mmdet - INFO - Epoch [1][800/1833]#011lr: 4.000e-02, eta: 0:12:04, time: 0.652, data_time: 0.059, memory: 4435, loss_rpn_cls: 0.0673, loss_rpn_bbox: 0.0724, loss_cls: 0.3781, acc: 89.9177, loss_bbox: 0.3351, loss_mask: 0.3750, loss: 1.2281
  | 2020-07-27T13:02:32.438-04:00 | 2020-07-27 17:02:32,123 - mmdet - INFO - Epoch [1][850/1833]#011lr: 4.000e-02, eta: 0:11:26, time: 0.654, data_time: 0.069, memory: 4435, loss_rpn_cls: 0.0654, loss_rpn_bbox: 0.0723, loss_cls: 0.3795, acc: 89.8394, loss_bbox: 0.3376, loss_mask: 0.3711, loss: 1.2260
  | 2020-07-27T13:03:05.488-04:00 | 2020-07-27 17:03:04,981 - mmdet - INFO - Epoch [1][900/1833]#011lr: 4.000e-02, eta: 0:10:49, time: 0.658, data_time: 0.069, memory: 4435, loss_rpn_cls: 0.0697, loss_rpn_bbox: 0.0726, loss_cls: 0.3715, acc: 89.9232, loss_bbox: 0.3322, loss_mask: 0.3649, loss: 1.2109
  | 2020-07-27T13:03:38.517-04:00 | 2020-07-27 17:03:37,967 - mmdet - INFO - Epoch [1][950/1833]#011lr: 4.000e-02, eta: 0:10:12, time: 0.660, data_time: 0.062, memory: 4435, loss_rpn_cls: 0.0675, loss_rpn_bbox: 0.0758, loss_cls: 0.3804, acc: 89.8636, loss_bbox: 0.3318, loss_mask: 0.3710, loss: 1.2266
  | 2020-07-27T13:04:11.544-04:00 | 2020-07-27 17:04:10,796 - mmdet - INFO - Epoch [1][1000/1833]#011lr: 4.000e-02, eta: 0:09:36, time: 0.657, data_time: 0.062, memory: 4435, loss_rpn_cls: 0.0656, loss_rpn_bbox: 0.0709, loss_cls: 0.3643, acc: 90.0182, loss_bbox: 0.3273, loss_mask: 0.3621, loss: 1.1902
  | 2020-07-27T13:04:43.557-04:00 | 2020-07-27 17:04:43,236 - mmdet - INFO - Epoch [1][1050/1833]#011lr: 4.000e-02, eta: 0:09:00, time: 0.649, data_time: 0.082, memory: 4435, loss_rpn_cls: 0.0641, loss_rpn_bbox: 0.0718, loss_cls: 0.3637, acc: 90.1061, loss_bbox: 0.3236, loss_mask: 0.3581, loss: 1.1812
  | 2020-07-27T13:05:16.588-04:00 | 2020-07-27 17:05:16,085 - mmdet - INFO - Epoch [1][1100/1833]#011lr: 4.000e-02, eta: 0:08:24, time: 0.657, data_time: 0.080, memory: 4435, loss_rpn_cls: 0.0628, loss_rpn_bbox: 0.0704, loss_cls: 0.3603, acc: 90.0443, loss_bbox: 0.3250, loss_mask: 0.3565, loss: 1.1749
  | 2020-07-27T13:05:49.633-04:00 | 2020-07-27 17:05:49,091 - mmdet - INFO - Epoch [1][1150/1833]#011lr: 4.000e-02, eta: 0:07:49, time: 0.660, data_time: 0.069, memory: 4435, loss_rpn_cls: 0.0617, loss_rpn_bbox: 0.0698, loss_cls: 0.3542, acc: 90.0363, loss_bbox: 0.3246, loss_mask: 0.3512, loss: 1.1614
  | 2020-07-27T13:06:22.696-04:00 | 2020-07-27 17:06:21,917 - mmdet - INFO - Epoch [1][1200/1833]#011lr: 4.000e-02, eta: 0:07:14, time: 0.657, data_time: 0.080, memory: 4435, loss_rpn_cls: 0.0641, loss_rpn_bbox: 0.0715, loss_cls: 0.3570, acc: 89.8932, loss_bbox: 0.3266, loss_mask: 0.3535, loss: 1.1726
  | 2020-07-27T13:06:55.708-04:00 | 2020-07-27 17:06:54,969 - mmdet - INFO - Epoch [1][1250/1833]#011lr: 4.000e-02, eta: 0:06:39, time: 0.661, data_time: 0.065, memory: 4435, loss_rpn_cls: 0.0646, loss_rpn_bbox: 0.0699, loss_cls: 0.3432, acc: 90.2833, loss_bbox: 0.3157, loss_mask: 0.3492, loss: 1.1427
  | 2020-07-27T13:07:28.742-04:00 | 2020-07-27 17:07:28,001 - mmdet - INFO - Epoch [1][1300/1833]#011lr: 4.000e-02, eta: 0:06:04, time: 0.660, data_time: 0.063, memory: 4435, loss_rpn_cls: 0.0615, loss_rpn_bbox: 0.0703, loss_cls: 0.3492, acc: 90.0636, loss_bbox: 0.3210, loss_mask: 0.3464, loss: 1.1484
  | 2020-07-27T13:08:00.767-04:00 | 2020-07-27 17:08:00,739 - mmdet - INFO - Epoch [1][1350/1833]#011lr: 4.000e-02, eta: 0:05:29, time: 0.655, data_time: 0.064, memory: 4435, loss_rpn_cls: 0.0608, loss_rpn_bbox: 0.0706, loss_cls: 0.3445, acc: 90.0633, loss_bbox: 0.3231, loss_mask: 0.3460, loss: 1.1450
  | 2020-07-27T13:08:34.833-04:00 | 2020-07-27 17:08:34,132 - mmdet - INFO - Epoch [1][1400/1833]#011lr: 4.000e-02, eta: 0:04:55, time: 0.668, data_time: 0.060, memory: 4436, loss_rpn_cls: 0.0608, loss_rpn_bbox: 0.0707, loss_cls: 0.3359, acc: 90.2179, loss_bbox: 0.3172, loss_mask: 0.3419, loss: 1.1265
  | 2020-07-27T13:09:07.858-04:00 | 2020-07-27 17:09:07,001 - mmdet - INFO - Epoch [1][1450/1833]#011lr: 4.000e-02, eta: 0:04:21, time: 0.657, data_time: 0.058, memory: 4436, loss_rpn_cls: 0.0587, loss_rpn_bbox: 0.0681, loss_cls: 0.3417, acc: 90.2161, loss_bbox: 0.3189, loss_mask: 0.3422, loss: 1.1296
  | 2020-07-27T13:09:40.905-04:00 | 2020-07-27 17:09:39,955 - mmdet - INFO - Epoch [1][1500/1833]#011lr: 4.000e-02, eta: 0:03:46, time: 0.659, data_time: 0.061, memory: 4436, loss_rpn_cls: 0.0589, loss_rpn_bbox: 0.0684, loss_cls: 0.3358, acc: 90.2434, loss_bbox: 0.3175, loss_mask: 0.3394, loss: 1.1200
  | 2020-07-27T13:10:12.920-04:00 | 2020-07-27 17:10:12,908 - mmdet - INFO - Epoch [1][1550/1833]#011lr: 4.000e-02, eta: 0:03:12, time: 0.658, data_time: 0.052, memory: 4436, loss_rpn_cls: 0.0566, loss_rpn_bbox: 0.0684, loss_cls: 0.3350, acc: 90.4003, loss_bbox: 0.3130, loss_mask: 0.3396, loss: 1.1125
  | 2020-07-27T13:10:45.949-04:00 | 2020-07-27 17:10:45,732 - mmdet - INFO - Epoch [1][1600/1833]#011lr: 4.000e-02, eta: 0:02:38, time: 0.658, data_time: 0.059, memory: 4436, loss_rpn_cls: 0.0583, loss_rpn_bbox: 0.0665, loss_cls: 0.3287, acc: 90.4561, loss_bbox: 0.3083, loss_mask: 0.3338, loss: 1.0956
  | 2020-07-27T13:11:18.994-04:00 | 2020-07-27 17:11:18,854 - mmdet - INFO - Epoch [1][1650/1833]#011lr: 4.000e-02, eta: 0:02:04, time: 0.662, data_time: 0.065, memory: 4436, loss_rpn_cls: 0.0575, loss_rpn_bbox: 0.0666, loss_cls: 0.3332, acc: 90.2748, loss_bbox: 0.3120, loss_mask: 0.3341, loss: 1.1034
  | 2020-07-27T13:11:52.023-04:00 | 2020-07-27 17:11:51,693 - mmdet - INFO - Epoch [1][1700/1833]#011lr: 4.000e-02, eta: 0:01:30, time: 0.656, data_time: 0.087, memory: 4436, loss_rpn_cls: 0.0568, loss_rpn_bbox: 0.0678, loss_cls: 0.3334, acc: 90.3896, loss_bbox: 0.3106, loss_mask: 0.3340, loss: 1.1026
  | 2020-07-27T13:12:25.072-04:00 | 2020-07-27 17:12:24,563 - mmdet - INFO - Epoch [1][1750/1833]#011lr: 4.000e-02, eta: 0:00:56, time: 0.657, data_time: 0.066, memory: 4440, loss_rpn_cls: 0.0544, loss_rpn_bbox: 0.0654, loss_cls: 0.3215, acc: 90.5726, loss_bbox: 0.3049, loss_mask: 0.3308, loss: 1.0771
  | 2020-07-27T13:12:58.154-04:00 | 2020-07-27 17:12:57,368 - mmdet - INFO - Epoch [1][1800/1833]#011lr: 4.000e-02, eta: 0:00:22, time: 0.656, data_time: 0.053, memory: 4440, loss_rpn_cls: 0.0574, loss_rpn_bbox: 0.0675, loss_cls: 0.3164, acc: 90.6313, loss_bbox: 0.3071, loss_mask: 0.3317, loss: 1.0800
  | 2020-07-27T13:13:29.224-04:00 | 2020-07-27 17:13:28,858 - mmdet - INFO - Saving checkpoint at 1 epochs
  | 2020-07-27T13:13:56.237-04:00 | [ ] 0/5000, elapsed: 0s, ETA:#015[ ] 1/5000, 0.1 task/s, elapsed: 8s, ETA: 40935s#015[ ] 2/5000, 0.2 task/s, elapsed: 8s, ETA: 20464s#015[ ] 3/5000, 0.4 task/s, elapsed: 8s, ETA: 13640s#015[ ] 4/5000, 0.5 task/s, elapsed: 8s, ETA: 10228s#015[ ] 5/5000, 0.6 task/s, elapsed: 8s, ETA: 8181s#015[ ] 6/5000, 0.7 task/s, elapsed: 8s, ETA: 6816s#015[ ] 7/5000, 0.9 task/s, elapsed: 8s, ETA: 5841s#015[ ] 8/5000, 1.0 task/s, elapsed: 8s, ETA: 5110s#015[ ] 9/5000, 1.1 task/s, elapsed: 8s, ETA: 4541s#015[ ] 10/5000, 1.2 task/s, elapsed: 8s, ETA: 4086s#015[ ] 11/5000, 1.3 task/s, elapsed: 8s, ETA: 3714s#015[ ] 12/5000, 1.5 task/s, elapsed: 8s, ETA: 3404s#015[ ] 13/5000, 1.6 task/s, elapsed: 8s, ETA: 3141s#015[ ] 14/5000, 1.7 task/s, elapsed: 8s, ETA: 2916s#015[ ] 15/5000, 1.8 task/s, elapsed: 8s, ETA: 2722s#015[ ] 16/5000, 2.0 task/s, elapsed: 8s, ETA: 2551s#015[ ] 17/5000, 2.1 task/s, elapsed: 8s, ETA: 2400s#015[ ] 18/5000, 2.2 task/s, elapsed: 8s, ETA: 2267s#015[ ] 19/5000, 2.3 task/s, elapsed: 8s, ETA: 2147s#015[ ] 20/5000, 2.4 task/s, elapsed: 8s, ETA: 2039s#015[ ] 21/5000, 2.6 task/s, elapsed: 8s, ETA: 1942s#015[ ] 22/5000, 2.7 task/s, elapsed: 8s, ETA: 1853s#015[ ] 23/5000, 2.8 task/s, elapsed: 8s, ETA: 1772s#015[ ] 24/5000, 2.9 task/s, elapsed: 8s, ETA: 1698s#015[ ] 25/5000, 3.1 task/s, elapsed: 8s, ETA: 1630s#015[ ] 26/5000, 3.2 task/s, elapsed: 8s, ETA: 1567s#015[ ] 27/5000, 3.3 task/s, elapsed: 8s, ETA: 1508s#015[ ] 28/5000, 3.4 task/s, elapsed: 8s, ETA: 1454s#015[ ] 29/5000, 3.5 task/s, elapsed: 8s, ETA: 1404s#015[ ] 30/5000, 3.7 task/s, elapsed: 8s, ETA: 1357s#015[ ] 31/5000, 3.8 task/s, elapsed: 8s, ETA: 1313s#015[ ] 32/5000, 3.9 task/s, elapsed: 8s, ETA: 1271s#015[ ] 33/5000, 4.0 task/s, elapsed: 8s, ETA: 1246s#015[ ] 34/5000, 4.1 task/s, elapsed: 8s, ETA: 1209s#015[ ] 35/5000, 4.2 task/s, elapsed: 8s, ETA: 1175s#015[ ] 36/5000, 4.3 task/s, elapsed: 8s, ETA: 1142s#015[ ] 37/5000, 4.5 task/s, elapsed: 8s, ETA: 1111s#015[ ] 38/5000, 4.6 task/s, elapsed: 8s, ETA: 1081s#015[ ] 39/5000, 4.7 task/s, elapsed: 8s, ETA: 1053s#015[ ] 40/5000, 4.8 task/s, elapsed: 8s, ETA: 1027s#015[ ] 41/5000, 5.0 task/s, elapsed: 8s, ETA: 1002s#015[ ] 42/5000, 5.1 task/s, elapsed: 8s, ETA: 978s#015[ ] 43/5000, 5.2 task/s, elapsed: 8s, ETA: 955s#015[ ] 44/5000, 5.3 task/s, elapsed: 8s, ETA: 933s#015[ ] 45/5000, 5.4 task/s, elapsed: 8s, ETA: 912s#015[ ] 46/5000, 5.6 task/s, elapsed: 8s, ETA: 892s#015[ ] 47/5000, 5.7 task/s, elapsed: 8s, ETA: 873s#015[ ] 48/5000, 5.8 task/s, elapsed: 8s, ETA: 854s#015[ ] 49/5000, 5.9 task/s, elapsed: 8s, ETA: 837s#015[ ] 50/5000, 6.0 task/s, elapsed: 8s, ETA: 820s#015[ ] 51/5000, 6.2 task/s, elapsed: 8s, ETA: 804s#015[ ] 52/5000, 6.3 task/s, elapsed: 8s, ETA: 788s#015[ ] 53/5000, 6.4 task/s, elapsed: 8s, ETA: 773s#015[ ] 54/5000, 6.5 task/s, elapsed: 8s, ETA: 759s#015[ ] 55/5000, 6.6 task/s, elapsed: 8s, ETA: 745s#015[ ] 56/5000, 6.8 task/s, elapsed: 8s, ETA: 731s#015[ ] 57/5000, 6.9 task/s, elapsed: 8s, ETA: 718s#015[ ] 58/5000, 7.0 task/s, elapsed: 8s, ETA: 706s#015[ ] 59/5000, 7.1 task/s, elapsed: 8s, ETA: 694s#015[ ] 60/5000, 7.2 task/s, elapsed: 8s, ETA: 682s#015[ ] 61/5000, 7.4 task/s, elapsed: 8s, ETA: 671s#015[ ] 62/5000, 7.5 task/s, elapsed: 8s, ETA: 660s#015[ ] 63/5000, 7.6 task/s, elapsed: 8s, ETA: 649s#015[ ] 64/5000, 7.7 task/s, elapsed: 8s, ETA: 639s#015[ ] 65/5000, 7.8 task/s, elapsed: 8s, ETA: 635s#015[ ] 66/5000, 7.9 task/s, elapsed: 8s, ETA: 625s#015[ ] 67/5000, 8.0 task/s, elapsed: 8s, ETA: 616s#015[ ] 68/5000, 8.1 task/s, elapsed: 8s, ETA: 607s#015[ ] 69/5000, 8.2 task/s, elapsed: 8s, ETA: 598s#015[ ] 70/5000, 8.4 task/s, elapsed: 8s, ETA: 589s#015[ ] 71/5000, 8.5 task/s, elapsed: 8s, ETA: 581s#015[ ] 72/5000, 8.6 task/s, elapsed: 8s, ETA: 573s#015[ ] 73/5000, 8.7 task/s, elapsed: 8s, ETA: 565s#015[ ] 74/5000, 8.8 task/s, elapsed: 8s, ETA: 557s#015[ ] 75/5000, 9.0 task/s, elapsed: 8s, ETA: 549s#015[ ] 76/5000, 9.1 task/s, elapsed: 8s, ETA: 542s#015[ ] 77/5000, 9.2 task/s, elapsed: 8s, ETA: 535s#015[ ] 78/5000, 9.3 task/s, elapsed: 8s, ETA: 528s#015[ ] 79/5000, 9.4 task/s, elapsed: 8s, ETA: 521s#015[ ] 80/5000, 9.6 task/s, elapsed: 8s, ETA: 514s#015[ ] 81/5000, 9.7 task/s, elapsed: 8s, ETA: 508s#015[ ] 82/5000, 9.8 task/s, elapsed: 8s, ETA: 502s#015[ ] 83/5000, 9.9 task/s, elapsed: 8s, ETA: 496s#015[ ] 84/5000, 10.0 task/s, elapsed: 8s, ETA: 490s#015[ ] 85/5000, 10.2 task/s, elapsed: 8s, ETA: 484s#015[ ] 86/5000, 10.3 task/s, elapsed: 8s, ETA: 478s#015[ ] 87/5000, 10.4 task/s, elapsed: 8s, ETA: 472s#015[ ] 88/5000, 10.5 task/s, elapsed: 8s, ETA: 467s#015[ ] 89/5000, 10.6 task/s, elapsed: 8s, ETA: 462s#015[ ] 90/5000, 10.8 task/s, elapsed: 8s, ETA: 456s#015[ ] 91/5000, 10.9 task/s, elapsed: 8s, ETA: 451s#015[ ] 92/5000, 11.0 task/s, elapsed: 8s, ETA: 446s#015[ ] 93/5000, 11.1 task/s, elapsed: 8s, ETA: 441s#015[ ] 94/5000, 11.2 task/s, elapsed: 8s, ETA: 437s#015[ ] 95/5000, 11.4 task/s, elapsed: 8s, ETA: 432s#015[ ] 96/5000, 11.5 task/s, elapsed: 8s, ETA: 427s#015[ ] 97/5000, 11.5 task/s, elapsed: 8s, ETA: 428s#015[ ] 98/5000, 11.6 task/s, elapsed: 8s, ETA: 423s#015[ ] 99/5000, 11.7 task/s, elapsed: 8s, ETA: 419s#015[ ] 100/5000, 11.8 task/s, elapsed: 8s, ETA: 415s#015[ ] 101/5000, 11.9 task/s, elapsed: 8s, ETA: 411s#015[ ] 102/5000, 12.0 task/s, elapsed: 8s, ETA: 406s#015[ ] 103/5000, 12.2 task/s, elapsed: 8s, ETA: 402s#015[ ] 104/5000, 12.3 task/s, elapsed: 8s, ETA: 399s#015[ ] 105/5000, 12.4 task/s, elapsed: 8s, ETA: 395s#015[ ] 106/5000, 12.5 task/s, elapsed: 8s, ETA: 391s#015[ ] 107/5000, 12.6 task/s, elapsed: 8s, ETA: 387s#015[ ] 108/5000, 12.8 task/s, elapsed: 8s, ETA: 383s#015[ ] 109/5000, 12.9 task/s, elapsed: 8s, ETA: 380s#015[ ] 110/5000, 13.0 task/s, elapsed: 8s, ETA: 376s#015[ ] 111/5000, 13.1 task/s, elapsed: 8s, ETA: 373s#015[ ] 112/5000, 13.2 task/s, elapsed: 8s, ETA: 369s#015[ ] 113/5000, 13.3 task/s, elapsed: 8s, ETA: 366s#015[ ] 114/5000, 13.5 task/s, elapsed: 8s, ETA: 363s#015[ ] 115/5000, 13.6 task/s, elapsed: 8s, ETA: 360s#015[ ] 116/5000, 13.7 task/s, elapsed: 8s, ETA: 356s#015[ ] 117/5000, 13.8 task/s, elapsed: 8s, ETA: 353s#015[ ] 118/5000, 13.9 task/s, elapsed: 8s, ETA: 350s#015[ ] 119/5000, 14.1 task/s, elapsed: 8s, ETA: 347s#015[ ] 120/5000, 14.2 task/s, elapsed: 8s, ETA: 344s#015[ ] 121/5000, 14.3 task/s, elapsed: 8s, ETA: 341s#015[ ] 122/5000, 14.4 task/s, elapsed: 8s, ETA: 339s#015[ ] 123/5000, 14.5 task/s, elapsed: 8s, ETA: 336s#015[ ] 124/5000, 14.6 task/s, elapsed: 8s, ETA: 333s#015[ ] 125/5000, 14.8 task/s, elapsed: 8s, ETA: 330s#015[ ] 126/5000, 14.9 task/s, elapsed: 8s, ETA: 328s#015[ ] 127/5000, 15.0 task/s, elapsed: 8s, ETA: 325s#015[ ] 128/5000, 15.1 task/s, elapsed: 8s, ETA: 322s#015[ ] 129/5000, 14.9 task/s, elapsed: 9s, ETA: 326s#015[ ] 130/5000, 15.0 task/s, elapsed: 9s, ETA: 324s#015[ ] 131/5000, 15.2 task/s, elapsed: 9s, ETA: 321s#015[ ] 132/5000, 15.3 task/s, elapsed: 9s, ETA: 319s#015[ ] 133/5000, 15.4 task/s, elapsed: 9s, ETA: 316s#015[ ] 134/5000, 15.5 task/s, elapsed: 9s, ETA: 314s#015[ ] 135/5000, 15.6 task/s, elapsed: 9s, ETA: 312s#015[ ] 136/5000, 15.7 task/s, elapsed: 9s, ETA: 309s#015[ ] 137/5000, 15.8 task/s, elapsed: 9s, ETA: 307s#015[ ] 138/5000, 16.0 task/s, elapsed: 9s, ETA: 305s#015[ ] 139/5000, 16.1 task/s, elapsed: 9s, ETA: 302s#015[ ] 140/5000, 16.2 task/s, elapsed: 9s, ETA: 300s#015[ ] 141/5000, 16.3 task/s, elapsed: 9s, ETA: 298s#015[ ] 142/5000, 16.4 task/s, elapsed: 9s, ETA: 296s#015[ ] 143/5000, 16.5 task/s, elapsed: 9s, ETA: 294s#015[ ] 144/5000, 16.7 task/s, elapsed: 9s, ETA: 292s#015[ ] 145/5000, 16.8 task/s, elapsed: 9s, ETA: 289s#015[ ] 146/5000, 16.9 task/s, elapsed: 9s, ETA: 287s#015[ ] 147/5000, 17.0 task/s, elapsed: 9s, ETA: 285s#015[ ] 148/5000, 17.1 task/s, elapsed: 9s, ETA: 283s#015[ ] 149/5000, 17.2 task/s, elapsed: 9s, ETA: 281s#015[ ] 150/5000, 17.3 task/s, elapsed: 9s, ETA: 280s#015[ ] 151/5000, 17.5 task/s, elapsed: 9s, ETA: 278s#015[ ] 152/5000, 17.6 task/s, elapsed: 9s, ETA: 276s#015[ ] 153/5000, 17.7 task/s, elapsed: 9s, ETA: 274s#015[ ] 154/5000, 17.8 task/s, elapsed: 9s, ETA: 272s#015[ ] 155/5000, 17.9 task/s, elapsed: 9s, ETA: 270s#015[ ] 156/5000, 18.0 task/s, elapsed: 9s, ETA: 268s#015[ ] 157/5000, 18.2 task/s, elapsed: 9s, ETA: 267s#015[ ] 158/5000, 18.3 task/s, elapsed: 9s, ETA: 265s#015[ ] 159/5000, 18.4 task/s, elapsed: 9s, ETA: 263s#015[ ] 160/5000, 18.5 task/s, elapsed: 9s, ETA: 262s#015[ ] 161/5000, 18.5 task/s, elapsed: 9s, ETA: 262s#015[ ] 162/5000, 18.6 task/s, elapsed: 9s, ETA: 260s#015[ ] 163/5000, 18.7 task/s, elapsed: 9s, ETA: 259s#015[ ] 164/5000, 18.8 task/s, elapsed: 9s, ETA: 257s#015[ ] 165/5000, 18.9 task/s, elapsed: 9s, ETA: 255s#015[ ] 166/5000, 19.0 task/s, elapsed: 9s, ETA: 254s#015[ ] 167/5000, 19.2 task/s, elapsed: 9s, ETA: 252s#015[ ] 168/5000, 19.3 task/s, elapsed: 9s, ETA: 251s#015[ ] 169/5000, 19.4 task/s, elapsed: 9s, ETA: 249s#015[ ] 170/5000, 19.5 task/s, elapsed: 9s, ETA: 248s#015[ ] 171/5000, 19.6 task/s, elapsed: 9s, ETA: 246s#015[ ] 172/5000, 19.7 task/s, elapsed: 9s, ETA: 245s#015[> ] 173/5000, 19.8 task/s, elapsed: 9s, ETA: 243s#015[> ] 174/5000, 20.0 task/s, elapsed: 9s, ETA: 242s#015[> ] 175/5000, 20.1 task/s, elapsed: 9s, ETA: 240s#015[> ] 176/5000, 20.2 task/s, elapsed: 9s, ETA: 239s#015[> ] 177/5000, 20.3 task/s, elapsed: 9s, ETA: 237s#015[> ] 178/5000, 20.4 task/s, elapsed: 9s, ETA: 236s#015[> ] 179/5000, 20.5 task/s, elapsed: 9s, ETA: 235s#015[> ] 180/5000, 20.7 task/s, elapsed: 9s, ETA: 233s#015[> ] 181/5000, 20.8 task/s, elapsed: 9s, ETA: 232s#015[> ] 182/5000, 20.9 task/s, elapsed: 9s, ETA: 231s#015[> ] 183/5000, 21.0 task/s, elapsed: 9s, ETA: 229s#015[> ] 184/5000, 21.1 task/s, elapsed: 9s, ETA: 228s#015[> ] 185/5000, 21.2 task/s, elapsed: 9s, ETA: 227s#015[> ] 186/5000, 21.3 task/s, elapsed: 9s, ETA: 226s#015[> ] 187/5000, 21.5 task/s, elapsed: 9s, ETA: 224s#015[> ] 188/5000, 21.6 task/s, elapsed: 9s, ETA: 223s#015[> ] 189/5000, 21.7 task/s, elapsed: 9s, ETA: 222s#015[> ] 190/5000, 21.8 task/s, elapsed: 9s, ETA: 221s#015[> ] 191/5000, 21.9 task/s, elapsed: 9s, ETA: 219s#015[> ] 192/5000, 22.0 task/s, elapsed: 9s, ETA: 218s#015[> ] 193/5000, 21.9 task/s, elapsed: 9s, ETA: 219s#015[> ] 194/5000, 22.1 task/s, elapsed: 9s, ETA: 218s#015[> ] 195/5000, 22.2 task/s, elapsed: 9s, ETA: 217s#015[> ] 196/5000, 22.3 task/s, elapsed: 9s, ETA: 216s#015[> ] 197/5000, 22.4 task/s, elapsed: 9s, ETA: 214s#015[> ] 198/5000, 22.5 task/s, elapsed: 9s, ETA: 213s#015[> ] 199/5000, 22.6 task/s, elapsed: 9s, ETA: 212s#015[> ] 200/5000, 22.7 task/s, elapsed: 9s, ETA: 211s#015[> ] 201/5000, 22.8 task/s, elapsed: 9s, ETA: 210s#015[> ] 202/5000, 23.0 task/s, elapsed: 9s, ETA: 209s#015[> ] 203/5000, 23.1 task/s, elapsed: 9s, ETA: 208s#015[> ] 204/5000, 23.2 task/s, elapsed: 9
  | 2020-07-27T13:13:56.237-04:00 | s, ETA: 207s#015[> ] 205/5000, 23.3 task/s, elapsed: 9s, ETA: 206s#015[> ] 206/5000, 23.4 task/s, elapsed: 9s, ETA: 205s#015[> ] 207/5000, 23.5 task/s, elapsed: 9s, ETA: 204s#015[> ] 208/5000, 23.6 task/s, elapsed: 9s, ETA: 203s#015[> ] 209/5000, 23.8 task/s, elapsed: 9s, ETA: 202s#015[> ] 210/5000, 23.9 task/s, elapsed: 9s, ETA: 201s#015[> ] 211/5000, 24.0 task/s, elapsed: 9s, ETA: 200s#015[> ] 212/5000, 24.1 task/s, elapsed: 9s, ETA: 199s#015[> ] 213/5000, 24.2 task/s, elapsed: 9s, ETA: 198s#015[> ] 214/5000, 24.3 task/s, elapsed: 9s, ETA: 197s#015[> ] 215/5000, 24.4 task/s, elapsed: 9s, ETA: 196s#015[> ] 216/5000, 24.6 task/s, elapsed: 9s, ETA: 195s#015[> ] 217/5000, 24.7 task/s, elapsed: 9s, ETA: 194s#015[> ] 218/5000, 24.8 task/s, elapsed: 9s, ETA: 193s#015[> ] 219/5000, 24.9 task/s, elapsed: 9s, ETA: 192s#015[> ] 220/5000, 25.0 task/s, elapsed: 9s, ETA: 191s#015[> ] 221/5000, 25.1 task/s, elapsed: 9s, ETA: 190s#015[> ] 222/5000, 25.2 task/s, elapsed: 9s, ETA: 189s#015[> ] 223/5000, 25.3 task/s, elapsed: 9s, ETA: 188s#015[> ] 224/5000, 25.5 task/s, elapsed: 9s, ETA: 188s#015[> ] 225/5000, 25.2 task/s, elapsed: 9s, ETA: 190s#015[> ] 226/5000, 25.3 task/s, elapsed: 9s, ETA: 189s#015[> ] 227/5000, 25.4 task/s, elapsed: 9s, ETA: 188s#015[> ] 228/5000, 25.5 task/s, elapsed: 9s, ETA: 187s#015[> ] 229/5000, 25.6 task/s, elapsed: 9s, ETA: 186s#015[> ] 230/5000, 25.7 task/s, elapsed: 9s, ETA: 185s#015[> ] 231/5000, 25.8 task/s, elapsed: 9s, ETA: 185s#015[> ] 232/5000, 26.0 task/s, elapsed: 9s, ETA: 184s#015[> ] 233/5000, 26.1 task/s, elapsed: 9s, ETA: 183s#015[> ] 234/5000, 26.2 task/s, elapsed: 9s, ETA: 182s#015[> ] 235/5000, 26.3 task/s, elapsed: 9s, ETA: 181s#015[> ] 236/5000, 26.4 task/s, elapsed: 9s, ETA: 180s#015[> ] 237/5000, 26.5 task/s, elapsed: 9s, ETA: 180s#015[> ] 238/5000, 26.6 task/s, elapsed: 9s, ETA: 179s#015[> ] 239/5000, 26.7 task/s, elapsed: 9s, ETA: 178s#015[> ] 240/5000, 26.8 task/s, elapsed: 9s, ETA: 177s#015[> ] 241/5000, 27.0 task/s, elapsed: 9s, ETA: 177s#015[> ] 242/5000, 27.1 task/s, elapsed: 9s, ETA: 176s#015[> ] 243/5000, 27.2 task/s, elapsed: 9s, ETA: 175s#015[> ] 244/5000, 27.3 task/s, elapsed: 9s, ETA: 174s#015[> ] 245/5000, 27.4 task/s, elapsed: 9s, ETA: 174s#015[> ] 246/5000, 27.5 task/s, elapsed: 9s, ETA: 173s#015[> ] 247/5000, 27.6 task/s, elapsed: 9s, ETA: 172s#015[> ] 248/5000, 27.7 task/s, elapsed: 9s, ETA: 171s#015[> ] 249/5000, 27.9 task/s, elapsed: 9s, ETA: 171s#015[> ] 250/5000, 28.0 task/s, elapsed: 9s, ETA: 170s#015[> ] 251/5000, 28.1 task/s, elapsed: 9s, ETA: 169s#015[> ] 252/5000, 28.2 task/s, elapsed: 9s, ETA: 168s#015[> ] 253/5000, 28.3 task/s, elapsed: 9s, ETA: 168s#015[> ] 254/5000, 28.4 task/s, elapsed: 9s, ETA: 167s#015[> ] 255/5000, 28.5 task/s, elapsed: 9s, ETA: 166s#015[> ] 256/5000, 28.6 task/s, elapsed: 9s, ETA: 166s#015[> ] 257/5000, 28.4 task/s, elapsed: 9s, ETA: 167s#015[> ] 258/5000, 28.5 task/s, elapsed: 9s, ETA: 166s#015[> ] 259/5000, 28.6 task/s, elapsed: 9s, ETA: 165s#015[> ] 260/5000, 28.8 task/s, elapsed: 9s, ETA: 165s#015[> ] 261/5000, 28.9 task/s, elapsed: 9s, ETA: 164s#015[> ] 262/5000, 29.0 task/s, elapsed: 9s, ETA: 163s#015[> ] 263/5000, 29.1 task/s, elapsed: 9s, ETA: 163s#015[> ] 264/5000, 29.2 task/s, elapsed: 9s, ETA: 162s#015[> ] 265/5000, 29.3 task/s, elapsed: 9s, ETA: 162s#015[> ] 266/5000, 29.4 task/s, elapsed: 9s, ETA: 161s#015[> ] 267/5000, 29.5 task/s, elapsed: 9s, ETA: 160s#015[> ] 268/5000, 29.6 task/s, elapsed: 9s, ETA: 160s#015[> ] 269/5000, 29.8 task/s, elapsed: 9s, ETA: 159s#015[> ] 270/5000, 29.9 task/s, elapsed: 9s, ETA: 158s#015[> ] 271/5000, 30.0 task/s, elapsed: 9s, ETA: 158s#015[> ] 272/5000, 30.1 task/s, elapsed: 9s, ETA: 157s#015[> ] 273/5000, 30.2 task/s, elapsed: 9s, ETA: 157s#015[> ] 274/5000, 30.3 task/s, elapsed: 9s, ETA: 156s#015[> ] 275/5000, 30.4 task/s, elapsed: 9s, ETA: 155s#015[> ] 276/5000, 30.5 task/s, elapsed: 9s, ETA: 155s#015[> ] 277/5000, 30.6 task/s, elapsed: 9s, ETA: 154s#015[> ] 278/5000, 30.7 task/s, elapsed: 9s, ETA: 154s#015[> ] 279/5000, 30.9 task/s, elapsed: 9s, ETA: 153s#015[> ] 280/5000, 31.0 task/s, elapsed: 9s, ETA: 152s#015[> ] 281/5000, 31.1 task/s, elapsed: 9s, ETA: 152s#015[> ] 282/5000, 31.2 task/s, elapsed: 9s, ETA: 151s#015[> ] 283/5000, 31.3 task/s, elapsed: 9s, ETA: 151s#015[> ] 284/5000, 31.4 task/s, elapsed: 9s, ETA: 150s#015[> ] 285/5000, 31.5 task/s, elapsed: 9s, ETA: 150s#015[> ] 286/5000, 31.6 task/s, elapsed: 9s, ETA: 149s#015[> ] 287/5000, 31.7 task/s, elapsed: 9s, ETA: 148s#015[> ] 288/5000, 31.8 task/s, elapsed: 9s, ETA: 148s#015[> ] 289/5000, 31.5 task/s, elapsed: 9s, ETA: 150s#015[> ] 290/5000, 31.6 task/s, elapsed: 9s, ETA: 149s#015[> ] 291/5000, 31.7 task/s, elapsed: 9s, ETA: 149s#015[> ] 292/5000, 31.8 task/s, elapsed: 9s, ETA: 148s#015[> ] 293/5000, 31.9 task/s, elapsed: 9s, ETA: 148s#015[> ] 294/5000, 32.0 task/s, elapsed: 9s, ETA: 147s#015[> ] 295/5000, 32.1 task/s, elapsed: 9s, ETA: 147s#015[> ] 296/5000, 32.2 task/s, elapsed: 9s, ETA: 146s#015[> ] 297/5000, 32.3 task/s, elapsed: 9s, ETA: 146s#015[> ] 298/5000, 32.4 task/s, elapsed: 9s, ETA: 145s#015[> ] 299/5000, 32.5 task/s, elapsed: 9s, ETA: 144s#015[> ] 300/5000, 32.6 task/s, elapsed: 9s, ETA: 144s#015[> ] 301/5000, 32.8 task/s, elapsed: 9s, ETA: 143s#015[> ] 302/5000, 32.9 task/s, elapsed: 9s, ETA: 143s#015[> ] 303/5000, 33.0 task/s, elapsed: 9s, ETA: 142s#015[> ] 304/5000, 33.1 task/s, elapsed: 9s, ETA: 142s#015[> ] 305/5000, 33.2 task/s, elapsed: 9s, ETA: 141s#015[> ] 306/5000, 33.3 task/s, elapsed: 9s, ETA: 141s#015[> ] 307/5000, 33.4 task/s, elapsed: 9s, ETA: 140s#015[> ] 308/5000, 33.5 task/s, elapsed: 9s, ETA: 140s#015[> ] 309/5000, 33.6 task/s, elapsed: 9s, ETA: 140s#015[> ] 310/5000, 33.7 task/s, elapsed: 9s, ETA: 139s#015[> ] 311/5000, 33.8 task/s, elapsed: 9s, ETA: 139s#015[> ] 312/5000, 33.9 task/s, elapsed: 9s, ETA: 138s#015[> ] 313/5000, 34.1 task/s, elapsed: 9s, ETA: 138s#015[> ] 314/5000, 34.2 task/s, elapsed: 9s, ETA: 137s#015[> ] 315/5000, 34.3 task/s, elapsed: 9s, ETA: 137s#015[> ] 316/5000, 34.4 task/s, elapsed: 9s, ETA: 136s#015[> ] 317/5000, 34.5 task/s, elapsed: 9s, ETA: 136s#015[> ] 318/5000, 34.6 task/s, elapsed: 9s, ETA: 135s#015[> ] 319/5000, 34.7 task/s, elapsed: 9s, ETA: 135s#015[> ] 320/5000, 34.8 task/s, elapsed: 9s, ETA: 134s#015[> ] 321/5000, 34.3 task/s, elapsed: 9s, ETA: 137s#015[> ] 322/5000, 34.4 task/s, elapsed: 9s, ETA: 136s#015[> ] 323/5000, 34.5 task/s, elapsed: 9s, ETA: 136s#015[> ] 324/5000, 34.6 task/s, elapsed: 9s, ETA: 135s#015[> ] 325/5000, 34.7 task/s, elapsed: 9s, ETA: 135s#015[> ] 326/5000, 34.8 task/s, elapsed: 9s, ETA: 134s#015[> ] 327/5000, 34.9 task/s, elapsed: 9s, ETA: 134s#015[> ] 328/5000, 35.0 task/s, elapsed: 9s, ETA: 133s#015[> ] 329/5000, 35.1 task/s, elapsed: 9s, ETA: 133s#015[> ] 330/5000, 35.2 task/s, elapsed: 9s, ETA: 133s#015[> ] 331/5000, 35.3 task/s, elapsed: 9s, ETA: 132s#015[> ] 332/5000, 35.4 task/s, elapsed: 9s, ETA: 132s#015[> ] 333/5000, 35.5 task/s, elapsed: 9s, ETA: 131s#015[> ] 334/5000, 35.6 task/s, elapsed: 9s, ETA: 131s#015[> ] 335/5000, 35.8 task/s, elapsed: 9s, ETA: 130s#015[> ] 336/5000, 35.9 task/s, elapsed: 9s, ETA: 130s#015[> ] 337/5000, 36.0 task/s, elapsed: 9s, ETA: 130s#015[> ] 338/5000, 36.1 task/s, elapsed: 9s, ETA: 129s#015[> ] 339/5000, 36.2 task/s, elapsed: 9s, ETA: 129s#015[> ] 340/5000, 36.3 task/s, elapsed: 9s, ETA: 128s#015[> ] 341/5000, 36.4 task/s, elapsed: 9s, ETA: 128s#015[> ] 342/5000, 36.5 task/s, elapsed: 9s, ETA: 128s#015[> ] 343/5000, 36.6 task/s, elapsed: 9s, ETA: 127s#015[> ] 344/5000, 36.7 task/s, elapsed: 9s, ETA: 127s#015[>> ] 345/5000, 36.8 task/s, elapsed: 9s, ETA: 126s#015[>> ] 346/5000, 36.9 task/s, elapsed: 9s, ETA: 126s#015[>> ] 347/5000, 37.0 task/s, elapsed: 9s, ETA: 126s#015[>> ] 348/5000, 37.1 task/s, elapsed: 9s, ETA: 125s#015[>> ] 349/5000, 37.2 task/s, elapsed: 9s, ETA: 125s#015[>> ] 350/5000, 37.4 task/s, elapsed: 9s, ETA: 124s#015[>> ] 351/5000, 37.5 task/s, elapsed: 9s, ETA: 124s#015[>> ] 352/5000, 37.6 task/s, elapsed: 9s, ETA: 124s#015[>> ] 353/5000, 37.3 task/s, elapsed: 9s, ETA: 125s#015[>> ] 354/5000, 37.4 task/s, elapsed: 9s, ETA: 124s#015[>> ] 355/5000, 37.5 task/s, elapsed: 9s, ETA: 124s#015[>> ] 356/5000, 37.6 task/s, elapsed: 9s, ETA: 124s#015[>> ] 357/5000, 37.7 task/s, elapsed: 9s, ETA: 123s#015[>> ] 358/5000, 37.8 task/s, elapsed: 9s, ETA: 123s#015[>> ] 359/5000, 37.9 task/s, elapsed: 9s, ETA: 122s#015[>> ] 360/5000, 38.0 task/s, elapsed: 9s, ETA: 122s#015[>> ] 361/5000, 38.1 task/s, elapsed: 9s, ETA: 122s#015[>> ] 362/5000, 38.2 task/s, elapsed: 9s, ETA: 121s#015[>> ] 363/5000, 38.3 task/s, elapsed: 9s, ETA: 121s#015[>> ] 364/5000, 38.4 task/s, elapsed: 9s, ETA: 121s#015[>> ] 365/5000, 38.5 task/s, elapsed: 9s, ETA: 120s#015[>> ] 366/5000, 38.6 task/s, elapsed: 9s, ETA: 120s#015[>> ] 367/5000, 38.7 task/s, elapsed: 9s, ETA: 120s#015[>> ] 368/5000, 38.8 task/s, elapsed: 9s, ETA: 119s#015[>> ] 369/5000, 39.0 task/s, elapsed: 9s, ETA: 119s#015[>> ] 370/5000, 39.1 task/s, elapsed: 9s, ETA: 119s#015[>> ] 371/5000, 39.2 task/s, elapsed: 9s, ETA: 118s#015[>> ] 372/5000, 39.3 task/s, elapsed: 9s, ETA: 118s#015[>> ] 373/5000, 39.4 task/s, elapsed: 9s, ETA: 118s#015[>> ] 374/5000, 39.5 task/s, elapsed: 9s, ETA: 117s#015[>> ] 375/5000, 39.6 task/s, elapsed: 9s, ETA: 117s#015[>> ] 376/5000, 39.7 task/s, elapsed: 9s, ETA: 117s#015[>> ] 377/5000, 39.8 task/s, elapsed: 9s, ETA: 116s#015[>> ] 378/5000, 39.9 task/s, elapsed: 9s, ETA: 116s#015[>> ] 379/5000, 40.0 task/s, elapsed: 9s, ETA: 116s#015[>> ] 380/5000, 40.1 task/s, elapsed: 9s, ETA: 115s#015[>> ] 381/5000, 40.2 task/s, elapsed: 9s, ETA: 115s#015[>> ] 382/5000, 40.3 task/s, elapsed: 9s, ETA: 

<LOG is redacted >

0s#015[>>>>>>>>>>>>>>>>>>>>>>>>>>] 5024/5000, 214.8 task/s, elapsed: 23s, ETA: 0s
Traceback (most recent call last):
  | 2020-07-27T13:13:56.241-04:00 | File "/opt/ml/code/mmdetection/tools/train.py", line 153, in <module>
  | 2020-07-27T13:13:56.241-04:00 | main()
  | 2020-07-27T13:13:56.241-04:00 | File "/opt/ml/code/mmdetection/tools/train.py", line 149, in main
  | 2020-07-27T13:13:56.241-04:00 | meta=meta)
  | 2020-07-27T13:13:56.241-04:00 | File "/opt/ml/code/mmdetection/mmdet/apis/train.py", line 128, in train_detector
  | 2020-07-27T13:13:56.241-04:00 | runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
  | 2020-07-27T13:13:56.241-04:00 | File "/opt/conda/lib/python3.6/site-packages/mmcv/runner/epoch_based_runner.py", line 122, in run
  | 2020-07-27T13:13:56.241-04:00 | epoch_runner(data_loaders[i], **kwargs)
  | 2020-07-27T13:13:56.241-04:00 | File "/opt/conda/lib/python3.6/site-packages/mmcv/runner/epoch_based_runner.py", line 46, in train
  | 2020-07-27T13:13:56.241-04:00 | self.call_hook('after_train_epoch')
  | 2020-07-27T13:13:56.241-04:00 | File "/opt/conda/lib/python3.6/site-packages/mmcv/runner/base_runner.py", line 282, in call_hook
  | 2020-07-27T13:13:56.241-04:00 | getattr(hook, fn_name)(self)
  | 2020-07-27T13:13:56.241-04:00 | File "/opt/ml/code/mmdetection/mmdet/core/evaluation/eval_hooks.py", line 71, in after_train_epoch
  | 2020-07-27T13:13:56.242-04:00 | gpu_collect=self.gpu_collect)
  | 2020-07-27T13:13:56.242-04:00 | File "/opt/ml/code/mmdetection/mmdet/apis/test.py", line 113, in multi_gpu_test
  | 2020-07-27T13:13:56.242-04:00 | results = collect_results_cpu(results, len(dataset), tmpdir)
  | 2020-07-27T13:13:56.242-04:00 | File "/opt/ml/code/mmdetection/mmdet/apis/test.py", line 147, in collect_results_cpu
  | 2020-07-27T13:13:56.242-04:00 | part_list.append(mmcv.load(part_file))
  | 2020-07-27T13:13:56.242-04:00 | File "/opt/conda/lib/python3.6/site-packages/mmcv/fileio/io.py", line 41, in load
  | 2020-07-27T13:13:56.242-04:00 | obj = handler.load_from_path(file, **kwargs)
  | 2020-07-27T13:13:56.242-04:00 | File "/opt/conda/lib/python3.6/site-packages/mmcv/fileio/handlers/pickle_handler.py", line 14, in load_from_path
  | 2020-07-27T13:13:56.242-04:00 | filepath, mode='rb', **kwargs)
  | 2020-07-27T13:13:56.242-04:00 | File "/opt/conda/lib/python3.6/site-packages/mmcv/fileio/handlers/base.py", line 20, in load_from_path
  | 2020-07-27T13:13:56.242-04:00 | with open(filepath, mode) as f:
  | 2020-07-27T13:13:56.242-04:00 | FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/output/.eval_hook/part_8.pkl'
  | 2020-07-27T13:14:02.244-04:00 | Traceback (most recent call last):
  | 2020-07-27T13:14:02.244-04:00 | File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
  | 2020-07-27T13:14:02.244-04:00 | "__main__", mod_spec)
  | 2020-07-27T13:14:02.244-04:00 | File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
  | 2020-07-27T13:14:02.244-04:00 | exec(code, run_globals)
  | 2020-07-27T13:14:02.244-04:00 | File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 263, in <module>
  | 2020-07-27T13:14:02.244-04:00 | main()
  | 2020-07-27T13:14:02.244-04:00 | File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 259, in main
  | 2020-07-27T13:14:02.244-04:00 | cmd=cmd)
  | 2020-07-27T13:14:02.244-04:00 | subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', '/opt/ml/code/mmdetection/tools/train.py', '--local_rank=7', '/opt/ml/code/updated_config.py', '--launcher', 'pytorch', '--work-dir', '/opt/ml/output']' returned non-zero exit status 1.

Bug fix If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!

hellock commented 4 years ago

For the default setting, we train with multiple nodes on a shared storage. If there is no shared storage for the nodes, you can specify gpu_collect = True in the evaluation filed in the config file.

vdabravolski commented 4 years ago

Thanks, that addressed my issue. Closing ticket now. For reference, this setting needs to be updated in base model config file: evaluation.gpu_collect=True

HawkRong commented 4 years ago

@vdabravolski Hello. I noticed you launched your training job several times on several nodes separately, instead of using the slurm_train.sh script provided by mmdetection. I personally tried to use this script in mmdetection 1.0 to launch training over multiple nodes of a cluster managed with slurm, but my training just won't start. I doubt slurm failed to create training processes. Have you experienced similar issues? Since you did not use the slurm_train.sh script as well.

vdabravolski commented 4 years ago

@HawkRong, I didn't use Slurm at all. I used Amazon Sagemaker as training cluster, and Sagemaker doesn't support Slurm. Therefore, I just launched number of separate processes equal to GPU devices on each compute node. I used Python Subprocess to manage individual training processes. You can find my training script here: https://github.com/vdabravolski/mmdetection-sagemaker/blob/master/container_training/mmdetection_train.py#L213-L234

HawkRong commented 4 years ago

@vdabravolski Since the cluster available to me only support slurm, and the launch script in mmdetection 1.0 seems not working in my case, I plan to mimic your approach. I'm wondering if I use 'sbatch' command in slurm to launch several 1-node jobs separately, each executing the 'python -m torch.distributed.launch --nproc_per_node=x --nnodes=y --node_rank=z ...', will work? Another relevant question: will 'torch.distributed.init_process_group' do the job of waiting for all separate processes I launched on different nodes?

HawkRong commented 4 years ago

@vdabravolski The validation went well using mmdetection 1.0.

FantasyJXF commented 4 years ago

For the default setting, we train with multiple nodes on a shared storage. If there is no shared storage for the nodes, you can specify gpu_collect = True in the evaluation filed in the config file.

I trained mask-rcnn with mmdet 2.5.0 with 2 gpu, and met the error during testing, I followed your suggestion to specify gpu_collect = True in the evaluation filed in the config file, then the training process went well. Error Message:

FileNotFoundError: [Errno 2] No such file or directory: 'xxxxxxxxxxxxx.eval_hook/part_1.pkl' 
LMerCy commented 3 years ago

@FantasyJXF

I have also meet the problem, have you ever solved this?