System usage and training time

leanderme commented 5 years ago

Hi, I've a data set of ~ 3000 audio files (all zero padded to 60 secs). The data didn't fit into memory, so I've adapted your sed-crnn repository to use the data generator from this project. I only care aboud the sed labels, so I've removed everything related to Sound event localization

def generate(filenames_list, nb_total_batches, batch_seq_len, label_len, shuffle=True):
  while 1:
    if shuffle:
      random.shuffle(filenames_list)

    # Ideally this should have been outside the while loop. But while generating the test data we want the data
    # to be the same exactly for all epoch's hence we keep it here.
    circ_buf_feat = deque()
    circ_buf_label = deque()

    file_cnt = 0

    for i in range(nb_total_batches):

      # load feat and label to circular buffer. Always maintain at least one batch worth feat and label in the
      # circular buffer. If not keep refilling it.
      while len(circ_buf_feat) < batch_seq_len:
        tmp_feat_file = os.path.join(feat_folder, '{}_{}.npz'.format(filenames_list[file_cnt], 'mon' if is_mono else 'bin'))
        dmp = np.load(tmp_feat_file)
        temp_feat, temp_label = dmp['arr_0'], dmp['arr_1']

        for row_cnt, row in enumerate(temp_feat):
          circ_buf_feat.append(row)
          circ_buf_label.append(temp_label[row_cnt])

        file_cnt = file_cnt + 1

      # Read one batch size from the circular buffer
      feat = np.zeros((batch_seq_len, feat_len))
      label = np.zeros((batch_seq_len, label_len))

      for j in range(batch_seq_len):
        feat[j, :] = circ_buf_feat.popleft()
        label[j, :] = circ_buf_label.popleft()
      feat = np.reshape(feat, (batch_seq_len, feat_len))

      # Split to sequences
      feat = utils.split_in_seqs(feat, seq_len)
      feat = utils.split_multi_channels(feat, nb_ch)

      label = utils.split_in_seqs(label, seq_len)

      yield feat, label

With seq_len = 512 and batch_size = 512, it still takes roughly 130 seconds / epoch (~ 3s/step). I'm using a 1080 TI, the system RAM is 64 GB. With these params, the GPU usage is around 30%.

My question Did you experience similar training times? Did you manage to increase load on your GPU while training? I'm wondering if the data generator is the issue here? Did you experiment with multiprocessing?

sharathadavanne commented 5 years ago

I had experienced similar poor GPU loading. In my experiments, there were four things that improved my GPU load, but still, the load was around 40%.

The things that helped me were - a) using more CPU's, helped improve file IO. 4 CPU's worked the best for this DCASE data. b) I had to turn off the use_multiprocessing=False flag, it gave 4x improvements in comparison to the flag set to True. c) increased the number of epochs per fit_generator() call and d) increased batch size. Even after all this, the GPU was most of the time waiting for the data, and when it got the data, it would process it no time. I just couldn't get the right configuration to overcome the file IO bottleneck. In case you do find a good balance, let me know, I will update this code accordingly.

leanderme commented 5 years ago

Thank you for the fast reply! I noticed a perfomance drop using multiprocssing, too. I think there is a lot of confusion about this in the community. In various github issues most of the users reported no perfomance gains. I'll try again with the sequential api, but I suspect this won't change anything. Others recommended using h5py. I'll try that too and report results!

leanderme commented 5 years ago

I investigated further and changing to sequential api and using h5py did not improve the training speed. However, I noticed a significant performance gain using CuDNNGRU in place of GRU. The downside of this is that Recurrent dropout is not implemented in cuDNN RNN ops (see this issue), apart from experimental implementations.

sharathadavanne commented 5 years ago

Thanks for your experiments @leanderme It will help others trying similar approaches.

sharathadavanne / seld-dcase2019

System usage and training time #8