Guidance on How to Tune BatchedDataLoader

Hi All,

I tried reading the docs, issues, and pull requests and haven't felt like I have a good handle on how to optimize a dataloader in pytorch. I'm using Pytorch 1.5, Cloudera HDFS, Nvidia Tesla V100, have 80 cores and 700 gb of memory.

I'm trying to read a smallish table of 36M rows with ~ 2000 values per row (they are grouped into five array columns). I've made sure the parquet block size is ~256MB

This is what I have for code:

dl = BatchedDataLoader(make_batch_reader(file_name, workers_count=10, schema_fields=col_names, hdfs_driver='libhdfs', reader_pool_type = 'process'), batch_size=batch_size, shuffling_queue_capacity=shuffling_queue_capacity)

Problem: I'm trying to figure out how to keep a single GPU running at full memory capacity from the cluster. The setup above is majorly IO bound. In very basic tuning that it seems like I can get through an epoch faster using a large batch size (500K) and a shuffling_queue_capacity of 10x that. Even with the large batch size I only use up half of the GPU memory 17GB/32GB. This setup seems to process a batch every ~20secs, but then the GPU sits idle for ~30secs as it waits for another batch (I wish I knew how to benchmark all this better) I've experimented with larger and smaller shuffle queues but it seems to burn through the queue super fast and then be bound by IO out of the cluster.

I've tried a much smaller batch size (64) and petastorm cranks through the batches constantly but the GPU memory is minimally used.

Questions: 1) Are large batches the best way to get high throughout out of the cluster? 2) Do I need to add more workers? I've increased the amount to 50 and haven't seen a improvement. 3) I read that the 'process' pool type is superior because I don't use any image c++ -like processing. Is this the right approach? 4) When I added the transform_fn to the batched data loader I'd almost immediately overflowed my gpu. Should I be making my batches smaller if I take this approach?

Would love to contribute a tutorial when we figure this out if that would be helpful.

The next step would be to put this on a much larger multi GPU machine. I need to figure out these throughput questions first.

Thanks for the great work!

Seems similar to this issue https://github.com/uber/petastorm/issues/493

@selitvin do you have any guidance in terms of diagnosing the questions above?

Sorry for the late response. I was away last week and was not monitoring my emails.

Thank you for the detailed description of your case and appreciate you going through existing issues and the documentation!

I'd ask the following questions:

For your NN training code (forward/backward prop.) , what is the maximal row-per-sec consumption rate (let's call it r_nn)? What is the batchsize that you need to achieve that? One way to measure is to just cache the first batch and feed it repeatedly into your training code while measuring the throughput. You should get close to 100% GPU utilization in this experiment, I assume.

What is the row-per-sec that you get from running the following tight-loop (even without rebatching):

reader = make_batch_reader(...)
while True:
batch = next(reader)  
# measure row-per-sec  : let's call it r_io
# Play with reader_pool_type process/thread; play with workers_count to see which combination gives you the highest r_io.

Do we have r_io > r_nn?

If r_io > r_nn, then we do have hope of setting up the system to saturate the GPU. As the next step I'd see how much rate do we drop by putting a BatchedDataLoader on top of the make_batch_reader and running it in the same tight loop. Are we loosing the throughput there?
If r_io < r_nn, we'll need to profile what's going on in the IO land and how can it be optimized. To do so, I'd change reader_pool_type='dummy' (makes the worker run on the main thread), and run under Python profiler. See what are the hotspots and then think what can be done about it.

Hope this helps...

Thank you for the guidance @selitvin! Update on r_io

I've run the following code to benchmark r_io:

from itertools import product

pools = ['thread', 'process']
workers = [i for i in range(20, 81, 20)]
for i in product(pools, workers):
    print(i)

import time

from petastorm.reader import make_batch_reader
from petastorm.pytorch import BatchedDataLoader

for pool_type, workers_count in product(pools, workers):
    with make_batch_reader(file_name, 
                               workers_count=workers_count, 
                               schema_fields=schema_cols, 
                               hdfs_driver='libhdfs',
                               reader_pool_type = pool_type) as reader:

        r_io_sum = 0
        samples = 5
        for sample_idx in range(samples):
            num_rows = 0
            post_first_batch = False
            print(f'SAMPLE: {sample_idx} -- workers_count: {workers_count} and pool_type: {pool_type}')
            batch_idx = 0
            while True:
                batch = next(reader)

                # exclude the first batch from time computation in case
                # reader lags on first batch of reader
                if not post_first_batch:
                    post_first_batch = True
                    start = time.time()

                # print batches to make sure the process isnt hanging
                if batch_idx % 10 == 0:
                    print(batch_idx)

                # sum total number of rows to calculate throughput
                num_rows += batch[0].shape[0]

                if batch_idx > 50:
                    break

                batch_idx += 1

            end = time.time()
            r_io = num_rows/(end - start)
            print(f'\t r_io: {r_io}')
            r_io_sum += r_io
        print(f'FINISHED: average r_io was {r_io_sum / samples}')

Which yielded the following output:

SAMPLING -- workers_count: 20 and pool_type: thread
     SAMPLE 0/9 -- r_io: 47289.3997842017
     SAMPLE 1/9 -- r_io: 48842.93403938571
     SAMPLE 2/9 -- r_io: 48546.25535884053
     SAMPLE 3/9 -- r_io: 47612.4815459365
     SAMPLE 4/9 -- r_io: 40063.17078407439
     SAMPLE 5/9 -- r_io: 49382.76953992936
     SAMPLE 6/9 -- r_io: 46786.332368129006
     SAMPLE 7/9 -- r_io: 48050.99179485772
     SAMPLE 8/9 -- r_io: 48326.967940064846
     SAMPLE 9/9 -- r_io: 42193.66452616447
FINISHED: average r_io was 46709.49676815842

SAMPLING -- workers_count: 50 and pool_type: thread  
     SAMPLE 0/9 -- r_io: 14967.043967968786
     SAMPLE 1/9 -- r_io: 15508.153836147667
     SAMPLE 2/9 -- r_io: 14742.278548151298
     SAMPLE 3/9 -- r_io: 16603.15669550825
     SAMPLE 4/9 -- r_io: 16091.641570850566
     SAMPLE 5/9 -- r_io: 15430.678891548883
     SAMPLE 6/9 -- r_io: 16523.03313438565
     SAMPLE 7/9 -- r_io: 21826.005453270616
     SAMPLE 8/9 -- r_io: 22088.700325769678
     SAMPLE 9/9 -- r_io: 16995.26939655195
FINISHED: average r_io was 17077.596182015335

SAMPLING -- workers_count: 80 and pool_type: thread  
     SAMPLE 0/9 -- r_io: 6308.844691935815
     SAMPLE 1/9 -- r_io: 5982.248499435853
     SAMPLE 2/9 -- r_io: 5627.0357041572715
     SAMPLE 3/9 -- r_io: 5725.003200946084
     SAMPLE 4/9 -- r_io: 6344.455604555025
     SAMPLE 5/9 -- r_io: 6337.626240406156
     SAMPLE 6/9 -- r_io: 5954.369076002141
     SAMPLE 7/9 -- r_io: 5789.844728023085
     SAMPLE 8/9 -- r_io: 6183.318914167719
     SAMPLE 9/9 -- r_io: 6120.812918811555
FINISHED: average r_io was 6037.355957844071

SAMPLING -- workers_count: 20 and pool_type: process 
     SAMPLE 0/9 -- r_io: 25138.236878261065
     SAMPLE 1/9 -- r_io: 23611.909485802513
     SAMPLE 2/9 -- r_io: 25261.459363242815
     SAMPLE 3/9 -- r_io: 25907.096433257884
     SAMPLE 4/9 -- r_io: 23173.003867129046
     SAMPLE 5/9 -- r_io: 26955.650560117658
     SAMPLE 6/9 -- r_io: 28025.459349069155
     SAMPLE 7/9 -- r_io: 24698.44730601666
     SAMPLE 8/9 -- r_io: 22864.565694072713
     SAMPLE 9/9 -- r_io: 24143.33538095894
FINISHED: average r_io was 24977.916431792844

SAMPLING -- workers_count: 50 and pool_type: process 
     SAMPLE 0/9 -- r_io: 26813.751892380726
     SAMPLE 1/9 -- r_io: 25272.543999493835
     SAMPLE 2/9 -- r_io: 26577.61466215444
     SAMPLE 3/9 -- r_io: 27982.130119654983
     SAMPLE 4/9 -- r_io: 24728.14891720711
     SAMPLE 5/9 -- r_io: 25970.6300840258
     SAMPLE 6/9 -- r_io: 28529.835520128672
     SAMPLE 7/9 -- r_io: 25037.357654011445
     SAMPLE 8/9 -- r_io: 26641.47463352383
     SAMPLE 9/9 -- r_io: 27916.105393833364
FINISHED: average r_io was 26546.959287641417

SAMPLING -- workers_count: 80 and pool_type: process
     SAMPLE 0/9 -- r_io: 20892.490186938205
     SAMPLE 1/9 -- r_io: 24667.224219708984
     SAMPLE 2/9 -- r_io: 24918.51854776131
     SAMPLE 3/9 -- r_io: 23008.608614715413
     SAMPLE 4/9 -- r_io: 23604.119538046947
     SAMPLE 5/9 -- r_io: 25003.16456680183
     SAMPLE 6/9 -- r_io: 28074.600471135745
     SAMPLE 7/9 -- r_io: 26454.70882168297
     SAMPLE 8/9 -- r_io: 25425.623173821612
     SAMPLE 9/9 -- r_io: 25714.620004098764
FINISHED: average r_io was 24776.367814471178

The code takes five samples of each run because 50 batches returned a different throughput (depending I imagine on the availability of the cluster). Interestingly the grid search over performance indicated that the thread approach with 20 workers outperfromed a process approach with more workers or more workers on the thread approach. Does this indicate a bug in the code? I'm showing consistent results across runs that reach nearly double the throughput that I currently have with 80 workers and process pooling.

Going to move on to r_nn next.

As for the benchmarking code, I'd suggest having several "warmup" cycles - there are some queues that are being filled up in the beginning, and you might want to measure the steady-state of the system - after all queues are full.

You might want to setup num_epochs=20 (or whatever number make sense) and not recreated reader in-between.

For this benchmark it make sense that thread-pool is faster as the inter-process-communication with a process pool is implemented in pyzmq. Although fast, it can not beat in-process thread-pool. It's great that thread-pool is faster because it means we don't do much of Python GIL intensive operations.

Things might start looking worse (hopefully not by much) when we will go through BatchedDataLoader. That's another interesting data point.

Is there a way to alter the process pool or the number of workers without recreating the reader? I think the reader creation slows down in later iterations because the work pool isn’t getting cleared because I’m in Jupyter.

On Tue, Jul 21, 2020 at 1:24 PM Yevgeni Litvin notifications@github.com wrote:

As for the benchmarking code, I'd suggest having several "warmup" cycles - there are some queues that are being filled up in the beginning, and you might want to measure the steady-state of the system - after all queues are full.

You might want to setup num_epochs=20 (or whatever number make sense) and not recreated reader in-between.

For this benchmark it make sense that thread-pool is faster as the inter-process-communication with a process pool is implemented in pyzmq. Although fast, it can not beat in-process thread-pool. It's great that thread-pool is faster because it means we don't do much of Python GIL intensive operations.

Things might start looking worse (hopefully not by much) when we will go through BatchedDataLoader. That's another interesting data point.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/uber/petastorm/issues/570#issuecomment-661996728, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALQO4BTOCNGUMZUVJMD4HFLR4XFMNANCNFSM4O25IIEQ .

Oops, I misread your code. A reader recreation is critical as you change the pool-type/workers count. I think your code is correct and the benchmark properly estimate the r_io.

uber / petastorm

Guidance on How to Tune BatchedDataLoader #570