[doc] profiling NVMe and configuring `aio` param section

stas00 commented 3 years ago

Let's use this issue to gather instructions on how to profile one's CPU<->NVMe setup.

(@tjruwase and I have been editing this post)

You need to do this on every new CPU/NVMe setup in order to configure: aio param section.

The following NVMe benchmark measures end-to-end performance of how fast it can read/write CPU<->NVMe. so make sure to test this on the actual system that you intend to use it on.

For this demonstration we are going to use:

XPG Gammix s11 pro 2tb NVMe drive
Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz setup.

1. Preparation

cd /somewhere/on/nvme/drive/you/want/to/test
git clone https://github.com/microsoft/DeepSpeed
cd DeepSpeed

You may have to also install libaio-dev if the Deepspeed NVMe driver fails to build. On Ubuntu it's just:

apt install libaio-dev

Depending on the speed of your NVMe, each benchmark could run for 30min or longer.

Important: make sure you're not doing any other I/O on the device you're testing or you will get incorrect results.

2. Run Read Benchmark

cd csrc/aio/py_test
dd if=/dev/urandom of=input.file count=400 bs=1M
mkdir read-logs
./run_read_sweep.sh input.file read-logs
python parse_aio_stats.py --log_dir read-logs/aio_perf_sweep --metric read_speed | sort -k9 -n | tail -1

This benchmark assumes the current working directory is on the NVMe drive. If it's not, copy the csrc/aio/py_test folder to your NVMe drive and run the test there.

You can, of course, use it to test non-NVMe drivers (e.g. SSD).

The tail of the list should show the fastest speeds.

Here is the best result for the read benchmark:

('read', 'block', 'overlap', 1, 1, 32, 262144) = 3.168102406435208

3. Run Write Benchmark

# cd csrc/aio/py_test
mkdir write-test-data
mkdir write-logs
./run_write_sweep.sh 400 write-test-data write-logs
python parse_aio_stats.py --logdir write-logs/aio_perf_sweep --metric write_speed | sort -k10 -n | tail -1

The write report best result:

('write', '400MB', 'block', 'overlap', 8, 1, 32, 262144) = 2.5923189261116324

4. Contribute your data

We need more read/write data for various devices to figure out how to make the configuration process automated.

If you're contributing your data, please post:

Your NVMe device name/size
advertised max read/write spec (google: "device name spec")
the results for the last 10 lines i.e.:

python parse_aio_stats.py --logdir read-logs/aio_perf_sweep --metric read_speed | sort -k9 -n | tail -10
python parse_aio_stats.py --logdir write-logs/aio_perf_sweep --metric write_speed | sort -k10 -n | tail -10

Important: please make sure not to do any other I/O on the device under benchmark.

5. Derive the `aio` params block

Now we need to figure out how to use the results of the benchmark to configure aio.

Here is the final result:

            "aio": {
                "block_size": 262144,
                "queue_depth": 32,
                "thread_count": 1,
                "single_submit": false,
                "overlap_events": true
            }

Most of this config block values come from the benchmarks best results for read and write - i.e. which configuration gives us the highest GB/s throughput (the higher the number the better)

Schema of each line in results is as follows:

read: read or write | single or block event completion | overlap or sequential event submission | # processes | intra-process parallelism | queue depth | block size | GB/sec
write: it's the same as read, plus 2nd column is the size of the written data.

The best read config was:

('read', 'block', 'overlap', 1, 1, 32, 262144) = 3.168102406435208

which corresponds to single_submit=false, overlap_events=true, queue_depth=32, block_size=262144

single_submit=true if the 2nd column is single instead of block. overlap_events=false if the 3rd column is sequential instead of overlap.

The best write config was :

('write', '400MB', 'block', 'overlap', 8, 1, 32, 262144) = 2.5923189261116324,

which corresponds to: single_submit=false, overlap_events=true, queue_depth=32, block_size=262144

Unfortunately, users don't currently have the ability to have separate read and write configurations, so they need to combine the best of both. Fortunately, in this case, and in most cases, the best read and write configurations are the same or similar.

Reasonable defaults are hard to set because of device and system differences. On many setups we tested block_size=1M had consistently seemed optimal across two clusters, but in this particular setup, block_size=256K seems to be optimal.

Finally, the last remaining config value is thread_count=1 is reasonable default, since this is per-rank configuration.

TODO: this config generation can be automated, but need to figure out what to do if read and write top benchmark don't agree.

Sample stats: for XPG Gammix s11 pro 2tb NVMe drive with published specs of:

max read speed of up to 3500 MB/s
max write speed of up to 3000 MB/s

The benchmark records throughput for ~400 different configuration combinations

read between 1.0-3.17 GB/s,
write between 1.2-2.59 GB/s and so now we can choose a single configuration that will lead to the highest throughput for read and write

I tried my 860 Evo SSD and getting ~0.5 GB/s read throughput. So about ~6x slower.

TODO/Questions to @tjruwase:

[ ] so we have a huge range of numbers - e.g. for read 1 to 3GB/s - so I suppose this is the effective range depending on the kind of task, so the low and the high should be considered - but how does this correlate to training? which of the 400 data points are most relevant? That's too much data for a user to make sense of. Perhaps it should just report and the min and max?

[ ] what are the good numbers? So that the users will know if their NVMe is fast enough? I'm thinking the numbers from the paper?

tjruwase commented 3 years ago

Benchmark results from DGX-2 node, which has 8 Micron 9200 NVMe raided into a single volume.

	Peak	1-process	multi-process
Read	28	25.3	25.6
Write	24.8	19.2	21.7

tjruwase commented 3 years ago

The sweep results suggest that zero-infinity can be configured to do offloading at read rate of 3GB/sec and write rate of 2.6GB/sec. So you want to configure the asynchronous I/O module similar to configurations that achieve these numbers. Specifically, you want to add the following to deepspeed config:

aio: {
 "block_size": 262144,
 "queue_depth": 32,
 "thread_count": 1,
"single_submit": false,
"overlap_events": true
}

tjruwase commented 3 years ago

Unfortunately, I just noticed a bug in the write sweep script, which may lower the write perf. Basically, it is not doing the multi-process sweep because of this oversight. I will merge #1001 to address this issue asap. This PR also avoids deleting the log folder, but rather creates an aio_perf_sweep subfolder. This subfolder is deleted on reruns though.

stas00 commented 3 years ago

The sweep results suggest that zero-infinity can be configured to do offloading at read rate of 3GB/sec and write rate of 2.6GB/sec. So you want to configure the asynchronous I/O module similar to configurations that achieve these numbers. Specifically, you want to add the following to deepspeed config:
aio: {
 "block_size": 262144,
 "queue_depth": 32,
 "thread_count": 1,
"single_submit": false,
"overlap_events": true
}

Oh, I missed that new config section!

So how does a user correlate their benchmark results to the above config that you prepared based on my benchmark results?

The description of each param at asynchronous I/O module is very terse and ideally we need a few paras explaining how to choose those values. Which are good defaults, and which numbers should be changed according to one's results.

Thank you!

stas00 commented 3 years ago

I looked again through your paper - please correct me if I'm wrong, but it looks like we need at least 3GB/s NVME<->CPU bandwidth per GPU, so really my NVME setup is only barely good for a single GPU to meet the efficiency standard you define in the paper. Am I wrong?

Especially given your report earlier:

Benchmark results from DGX-2 node, which has 8 Micron 9200 NVMe raided into a single volume. Peak 1-process multi-process Read 28 25.3 25.6 Write 24.8 19.2 21.7

So practically it's not enough to just have a single NVMe to benefit from ZeRO-Infinity if I understand it correctly.

tjruwase commented 3 years ago

1) I just added the section into the docs based on your feedback, so you did not miss it.

2) Sorry, I was not clear on how I came up with

aio: {
 "block_size": 262144,
 "queue_depth": 32,
 "thread_count": 1,
"single_submit": false,
"overlap_events": true
}

thread_count=1 is reasonable default, since this is per-rank configuration.

The rest are based the results of your sweep as follows:

Your best read config was ('read', 'block', 'overlap', 1, 1, 32, 262144) = 3.168102406435208, which corresponds to single_submit=false, overlap_events=true, queue_depth=32, block_size=262144.

Your best write config was ('write', '400MB', 'block', 'overlap', 8, 1, 32, 262144) = 2.5923189261116324, corresponding to single_submit=false, overlap_events=true, queue_depth=32, block_size=262144.

Unfortunately, users don't currently have the ability to have separate read and write configurations, so they need to combine the best of both. Fortunately, in this case, and in most cases, the best read and write configurations are the same or similar.

Another challenge that is more obvious to me now is that reasonable defaults are hard to set because of device and system differences. Prior to your experiments, block_size=1M had consistently seemed optimal across two clusters, but in your case, block_size=256K seems to be optimal.

Does this help?

stas00 commented 3 years ago

This helps a lot, thank you!

Can we make parse_aio_stats.py take in both read and write reports and generate the recommended config for the user?

tjruwase commented 3 years ago

I looked again through your paper - please correct me if I'm wrong, but it looks like we need at least 3GB/s NVME<->CPU bandwidth per GPU, so really my NVME setup is only barely good for a single GPU to meet the efficiency standard you define in the paper. Am I wrong? So practically it's not enough to just have a single NVMe to benefit from ZeRO-Infinity if I understand it correctly.

Can you clarify the efficiency standard you are referring to in the paper?

Whether or not a single 3GB/sec NVMe is sufficient for ZeRO-Infinity depends on a number of factors including:

1) What is offloaded to NVMe. For example if only optimizer state is offloaded, then only the CPU would access NVMe.

2) How well the asynchronous NVMe accesses overlap with forward and backward (and optimizer update). ZeRO-Infinity leverages asynchrony to hide/minimize the NVMe latencies.

3) Model and batch sizes are since they increase forward and backward time more than they increase optimizer and NVMe time. We have seen cases where optimizer time (which includes NVMe) is much smaller than forward/backward time.

4) If possible then scaling to more nodes linearly increases the NVMe bandwidth

stas00 commented 3 years ago

I looked again through your paper - please correct me if I'm wrong, but it looks like we need at least 3GB/s NVME<->CPU bandwidth per GPU, so really my NVME setup is only barely good for a single GPU to meet the efficiency standard you define in the paper. Am I wrong? So practically it's not enough to just have a single NVMe to benefit from ZeRO-Infinity if I understand it correctly.

Can you clarify the efficiency standard you are referring to in the paper?

snapshot_5

And it was mentioned several times in previous sections.

Whether or not a single 3GB/sec NVMe is sufficient for ZeRO-Infinity depends on a number of factors including:

1. What is offloaded to NVMe. For example if only optimizer state is offloaded, then only the CPU would access NVMe.

2. How well the asynchronous NVMe accesses overlap with forward and backward (and optimizer update). ZeRO-Infinity leverages asynchrony to hide/minimize the NVMe latencies.

3. Model and batch sizes are since they increase forward and backward time more than they increase optimizer and NVMe time.  We have seen cases where optimizer time (which includes NVMe) is much smaller than forward/backward time.

4. If possible then scaling to more nodes linearly increases the NVMe bandwidth

Thank you for sharing these considerations / questions to ask, @tjruwase.

How do we translate these into something actionable by the user. That is what exact steps they follow to set up each of these values:

            "offload_optimizer": {
                "device": "nvme",
                "nvme_path": "/local_nvme",
                "pin_memory": true,
                "buffer_count": 4,
                "fast_init": false
            },
            "offload_param": {
                "device": "nvme",
                "nvme_path": "/local_nvme",
                "pin_memory": true,
                "buffer_count": 5,
                "buffer_size": 1e8,
                "max_in_cpu": 1e9
            }

besides device, nvme_path and pin_memory which don't need explanation.

And you addressed how to get these numbers/flags - which is great!

            "aio": {
                "block_size": 262144,
                "queue_depth": 32,
                "thread_count": 1,
                "single_submit": false,
                "overlap_events": true
            }

stas00 commented 3 years ago

With the updated benchmark I get slightly worse results than before for write (was 2.59), and it has now switched to single_submit = true as the best

python parse_aio_stats.py --logdir write-logs/aio_perf_sweep --metric write_speed | sort -k10 -n | tail -10
('write', '400MB', 'single', 'overlap', 1, 1, 4, 1048576) = 2.542780607573091
('write', '400MB', 'block', 'overlap', 1, 1, 32, 1048576) = 2.549606370281151
('write', '400MB', 'single', 'overlap', 1, 1, 16, 524288) = 2.560568126052968
('write', '400MB', 'block', 'overlap', 1, 1, 16, 1048576) = 2.5607282070838893
('write', '400MB', 'single', 'overlap', 1, 1, 8, 524288) = 2.569547474836188
('write', '400MB', 'block', 'overlap', 1, 1, 8, 524288) = 2.577944913420765
('write', '400MB', 'block', 'overlap', 1, 1, 4, 262144) = 2.580567932852312
('write', '400MB', 'single', 'overlap', 1, 1, 4, 262144) = 2.584932481576203
('write', '400MB', 'block', 'overlap', 1, 1, 32, 262144) = 2.5864627469800396
('write', '400MB', 'single', 'overlap', 1, 1, 32, 262144) = 2.586675086832965

The read benchmark hasn't changed the best throughput, but the winning config has changed too!

('read', 'block', 'overlap', 1, 1, 32, 262144) = 3.159043494691866
('read', 'block', 'overlap', 1, 1, 4, 1048576) = 3.1590617679099946
('read', 'block', 'overlap', 1, 1, 8, 1048576) = 3.1595369457938087
('read', 'single', 'overlap', 1, 1, 8, 262144) = 3.1604938271604937
('read', 'block', 'overlap', 1, 1, 8, 262144) = 3.1612316918107815
('read', 'block', 'overlap', 1, 1, 16, 524288) = 3.1612926877741097
('read', 'single', 'overlap', 1, 1, 8, 524288) = 3.1613170868185194
('read', 'block', 'overlap', 1, 1, 8, 524288) = 3.1615855011664906
('read', 'single', 'overlap', 1, 1, 16, 131072) = 3.1634717867128006
('read', 'single', 'overlap', 1, 1, 32, 131072) = 3.1637100215689946

So am I correct that I now need to change my config to:

"single_submit": true,

But what should I do with block_size - the read benchmark is at 131072, whereas the write one is at 262144 - how do we reconcile this?

Also, why does the read benchmark runs echo 1 > /proc/sys/vm/drop_caches, but the write one doesn't? Is it not necessary because it writes and the cache is then always fully invalidated?

stas00 commented 3 years ago

@tjruwase, what do you think about putting the 400MB column last or removing it completely - since it's always the same number it doesn't tell anything to the user? Then it'd be easier to see the 2 sets aligned. Or alternatively have the same column for read too?

tjruwase commented 3 years ago

@stas00, regarding the changing best configs and results, I think since the perf differences are so small I would put it down to noise. Also, as you notice the top 10 (or more) perfs are very similar, so it seems to me that the most significant factors are overlap=true, queue_depth >=4, block_size >= 256K. All of these observations strengthen your call for users to be shielded from this information overload. So I will focus on having the script generate an optimal aio config setting, while hiding the optimality logic in the script. Does that sound reasonable.

Regarding the write benchmark, not dropping the caches, I did not do that because I thought it was more significant for reads as opposed to writes. If possible, can you please check if adding dropping caches to write benchmark makes much difference? I will update the script in my next PR.

stas00 commented 3 years ago

@stas00, regarding the changing best configs and results, I think since the perf differences are so small I would put it down to noise. Also, as you notice the top 10 (or more) perfs are very similar, so it seems to me that the most significant factors are overlap=true, queue_depth >=4, block_size >= 256K. All of these observations strengthen your call for users to be shielded from this information overload. So I will focus on having the script generate an optimal aio config setting, while hiding the optimality logic in the script. Does that sound reasonable.

That sounds fantastic! Thank you, @tjruwase

I'd also add that since we currently have a single config perhaps the final script should take the output of both parsers? or take read and write log dirs , run the parser on them and dump the recommended config, so the user will need to run:

read benchmark
write benchmark
create_config read-logs write-logs

and of course the first 2 can also be merged into the 3rd item as the next stage. But this would be a great start.

Regarding the write benchmark, not dropping the caches, I did not do that because I thought it was more significant for reads as opposed to writes. If possible, can you please check if adding dropping caches to write benchmark makes much difference? I will update the script in my next PR.

I will test and get back to you.

stas00 commented 3 years ago

oh and we probably should have the instruction sudo ./run_read_sweep.sh input.file read-logs so the sudo prompt doesn't come as a strange surprise after the script started.

stas00 commented 3 years ago

OK, so adding invalidating caching for write, had a negligible impact of 1e-2 difference. Hence, it's probably safe to not need it (as it slows the overall run time as well I think).

('write', '400MB', 'block', 'overlap', 1, 1, 2, 524288) = 2.5284379827435344
('write', '400MB', 'single', 'overlap', 1, 1, 8, 1048576) = 2.536109060119592
('write', '400MB', 'block', 'overlap', 1, 1, 4, 524288) = 2.5423465809286765
('write', '400MB', 'block', 'overlap', 1, 1, 8, 1048576) = 2.551528129258322
('write', '400MB', 'single', 'overlap', 1, 1, 32, 524288) = 2.5574265894943213
('write', '400MB', 'single', 'overlap', 1, 1, 4, 524288) = 2.572638084590551
('write', '400MB', 'block', 'overlap', 1, 1, 32, 524288) = 2.575145071954432
('write', '400MB', 'block', 'overlap', 1, 1, 16, 262144) = 2.5767529201574613
('write', '400MB', 'single', 'overlap', 1, 1, 16, 262144) = 2.577214990758583
('write', '400MB', 'block', 'overlap', 1, 1, 8, 262144) = 2.583110769162854

SeanNaren commented 3 years ago

Here are results from an A100 hyperplane server from Lambda, using DeepSpeed master and the instructions collected above!

Micron 7300 2TB NVMe (Max Read 3GB/s, Max Write 1.9 GB/s)

Read

('read', 'block', 'overlap', 2, 1, 8, 524288) = 2.0683565061893523
('read', 'block', 'overlap', 2, 1, 16, 524288) = 2.0690931110843103
('read', 'block', 'overlap', 2, 1, 16, 1048576) = 2.071279891429738
('read', 'block', 'sequential', 2, 1, 16, 1048576) = 2.0751389262701263
('read', 'block', 'sequential', 2, 1, 32, 524288) = 2.0761578914021417
('read', 'block', 'sequential', 2, 1, 32, 1048576) = 2.0790717269594086

For most of the configurations the difference is negligible. Range of 0.33-2.07 GB/s

Write

('write', '400MB', 'block', 'sequential', 1, 1, 4, 131072) = 1.9950197565890813

Again looking at the first 100ish tail outputs, the difference is negligible. Range of 1.23-1.995 GB/s

I think we can potentially reduce the grid search space to hone in on suggestions initially. it might also be a good idea to compare the same configurations across our 3 environments to see what the differential compared to the max throughput is

tjruwase commented 3 years ago

@SeanNaren, thanks for sharing your results. This is great since it is a different device from @stas00 and myself, so this is helping to evaluate the perf portability. Please see a few observations and questions below.

Am I reading correctly that the peak hardware write rate is 1.9GB/s, while the benchmark is reporting 1.99?
The 67% read perf of peak rate is much lower than @stas00 and I have observed, and I would like some help further understanding what is going on. I will defer the questions to the end of this post.
I agree that reducing the search space is critical, as @stas00 already noted. However, your results which show sequential > overlap deviates from our observations, making it harder for me to confidently propose things to prune. So far the following seems like the reasonable reduced space: block, [sequential|overlap], queue_depth=[4,8,16], block_size>=[256K,512K,1M]. What do you guys (@stas00) think?

Regarding further investigation of the relatively poor read performance, can you help with the following:

Did the benchmark have exclusive access to the device, or is the device shared with other system components, such as the OS?

Can you run an equivalent fio experiment to the best read configuration using the following fio config file:


[global]
bs=1M
iodepth=32
direct=1
ioengine=libaio
group_reporting
time_based
runtime=120
numjobs=2
name=raw-read
rw=read
directory=/local_nvme/
thread

[job1] filename=random_400MB.pt

stas00 commented 3 years ago

3. I agree that reducing the search space is critical, as @stas00 already noted. However, your results which show sequential > overlap deviates from our observations, making it harder for me to confidently propose things to prune. So far the following seems like the reasonable reduced space: block, [sequential|overlap], queue_depth=[4,8,16], block_size>=[256K,512K,1M]. What do you guys (@stas00) think?

Perhaps we should try and find a few more devices to acquire more data before we prune? Currently Sean's device seems to be one-off - basing a choice on 3 inputs is a bit hard and we might miss something.

tjruwase commented 3 years ago

@stas00, I completely agree. It would be awesome to find additional devices. Unfortunately, I don't have access to such diversity here. Also by pruning, I meant only the default setting of the search script, users will have the option of defining their own space.

stas00 commented 3 years ago

I added "4. Contribute your data" instructions to the OP - let's see if we can get some contributions.

I made a call to community inviting to contribute: https://discuss.huggingface.co/t/deepspeed-zero-infinity-looking-for-nvme-device-benchmarks/5787

thefazzer commented 3 years ago

I need to use sparse-attn/support-latest-triton as I'm running an RTX 3090

Setup is OK (as per below) but none of the DS optimisers will JIT compile (they all return an error at runtime re: the sm_86 arch).

Pretty sure problem is still outstanding for this card

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]

stas00 commented 3 years ago

Thank you for wanting to help us to gather the data, @thefazzer!

I have the same card, it works without problems if you have the right torch/cuda setup.

Let's not derail this thread and discuss this in a separate issue here on Deepspeed? Please post the output of your python -m torch.utils.collect_env and tag @stas00 so I won't miss it - and I will help you to get it working and then we can get back here.

Thank you!

SeanNaren commented 3 years ago

Am I reading correctly that the peak hardware write rate is 1.9GB/s, while the benchmark is reporting 1.99?

I looked all over the internet for this specific NVMe, and the reported max write was 1.9GB/s and these were the numbers collated from my run! I can confirm with Lambda on this.

I agree that reducing the search space is critical, as @stas00 already noted. However, your results which show sequential > overlap deviates from our observations, making it harder for me to confidently propose things to prune. So far the following seems like the reasonable reduced space: block, [sequential|overlap], queue_depth=[4,8,16], block_size>=[256K,512K,1M]. What do you guys (@stas00) think?

I haven't looked into this, but cloud providers may provide another standard for NVMe if we're trying to just collected data numbers.

Regarding further investigation of the relatively poor read performance, can you help with the following:

Did the benchmark have exclusive access to the device, or is the device shared with other system components, such as the OS?

This is shared with the OS as a root drive, there was nothing else running on the node (other than sys processes) at the time of running this benchmark.

Can you run an equivalent fio experiment to the best read configuration using the following fio config file:

More than happy to, however unsure how to run this! Maybe I missed something but if you could explain, can run it!

tjruwase commented 3 years ago

@SeanNaren, we have seen that device sharing with OS as a root drive does impact performance, especially read, even if nothing else is running on the node.

For fio, update the directory and filename fields of config file (e.g,config.fio), then install and run as follows: setup: sudo apt-get install fio run: fio config.fio

stas00 commented 3 years ago

@tjruwase, could we please change the benchmark to bail out if it can't run with the error message, otherwise it dumps the error into the benchmark files. and it will do it for all 400 files w/o user knowing it's not really running...

cat write-logs/aio_perf_sweep/write_400MB_single_overlap_t1_p1_d1_bs128K.txt
Testing deepspeed_aio python frontend
args = Namespace(block_size='128K', gpu=False, handle=True, io_parallel=1, loops=1, overlap_events=True, queue_depth=1, read_file=None, single_submit=True, threads=1, validate=False, write_file='write-test-data/ds_aio_write_400MB.pt', write_size='400M')
tid 0: schedule = {'pre': <function pre_handle_write at 0x7f517fcfe790>, 'post': <function post_handle at 0x7f517fcfe820>, 'main': <function main_parallel_write at 0x7f517fcfe940>}
tid 0: running pre-task
tid 0: Allocate tensor of size 419430400 bytes
tid 0: Write file write-test-data/ds_aio_write_400MB.pt.0 of size 419430400 bytes from buffer on device cpu
 [WARNING]  async_io requires the libraries: ['libaio-dev'] but are missing.
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/home/stas/anaconda3/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/stas/anaconda3/lib/python3.8/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/home/stas/hf/DeepSpeed/csrc/aio/py_test/ds_aio_handle.py", line 144, in _aio_handle_tasklet
    ctxt = schedule["pre"]((args, tid))
  File "/home/stas/hf/DeepSpeed/csrc/aio/py_test/ds_aio_handle.py", line 57, in pre_handle_write
    ctxt = pre_handle(args, tid, False)
  File "/home/stas/hf/DeepSpeed/csrc/aio/py_test/ds_aio_handle.py", line 32, in pre_handle
    handle = AsyncIOBuilder().load().aio_handle(args.block_size,
  File "/home/stas/hf/DeepSpeed/deepspeed/ops/op_builder/builder.py", line 215, in load
    return self.jit_load(verbose)
  File "/home/stas/hf/DeepSpeed/deepspeed/ops/op_builder/builder.py", line 219, in jit_load
    raise RuntimeError(
RuntimeError: Unable to JIT load the async_io op due to it not being compatible due to hardware/software issue.
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "./test_ds_aio.py", line 117, in <module>
    main()
  File "./test_ds_aio.py", line 113, in main
    multiprocess_function(args, False)
  File "/home/stas/hf/DeepSpeed/csrc/aio/py_test/ds_aio_handle.py", line 174, in aio_handle_multiprocessing
    pool_results = p.map(_aio_handle_tasklet, pool_params)
  File "/home/stas/anaconda3/lib/python3.8/multiprocessing/pool.py", line 364, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/home/stas/anaconda3/lib/python3.8/multiprocessing/pool.py", line 771, in get
    raise self._value
RuntimeError: Unable to JIT load the async_io op due to it not being compatible due to hardware/software issue.

Also as I mentioned earlier the WARNING:

 [WARNING]  async_io requires the libraries: ['libaio-dev'] but are missing.

should probably be an assert

Thank you!

tjruwase commented 3 years ago

Hmm, let me take a look at both issues today.

stas00 commented 3 years ago

oh, and btw, my earlier suggestion to move sudo outside the shell script was a bad idea - since then it then tries to run from a different environment and chances are it won't work. So sudo inside the read benchmark it is.

Perhaps should just warn the user with echo once of why we are asking to sudo, perhaps:

echo FYI: if sudo password prompt pops up, this is to enable flashing of the io cache

tjruwase commented 3 years ago

@stas00, two points on error handling.

1) I am able to make benchmark break early with something like below. Is this sufficient?

bash run_write_sweep.sh 400 /local_nvme/aio_test_write /tmp/write_400MB_1p
sync; sudo bash -c 'echo 1 > /proc/sys/vm/drop_caches'
python ./test_ds_aio.py --write_file /local_nvme/aio_test_write/ds_aio_write_400MB.pt --write_size 400M --io_parallel 1 --queue_depth 1 --block_size 128K --single_submit --overlap_events --handle --threads 1 &> /tmp/write_400MB_1p/aio_perf_sweep/write_single_overlap_t1_p1_d1_bs128K.txt
sync
Benchmark failed - for error details examine /tmp/write_400MB_1p/aio_perf_sweep/write_single_overlap_t1_p1_d1_bs128K.txt

2) Turning the missing libaio-dev into an assert is problematic because there are high level codes (e.g., unit tests) that are able to handle this situation without failing. On the other hand, the perf script (test_ds_aio.py) is not one of such high level codes, so perhaps we could make it fail gracefully in such situations. On the other hand, perhaps (1) should take of this for the purpose of perf sweep runs. What do you think?

stas00 commented 3 years ago

I propose a different approach, before running the benchmark have a test script that simply validates that everything in the env is ready? e.g. if it can't build the required extensions it would fail? that test script could be a small part of the benchmark where it needs all the ingredients to work?

So like ds_report tell us that:

[WARNING]  async_io requires the libraries: ['libaio-dev'] but are missing.

perhaps that same function that printed that warning can be used, except it'd assert?

And then this test script can be run at the beginning of both benchmarks.

Turning the missing libaio-dev into an assert is problematic because there are high level codes (e.g., unit tests) that are able to handle this situation without failing.

so as mentioned above - if ds_report knows it - shouldn't that be enough to assert?

op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the libraries: ['libaio-dev'] but are missing.
async_io ............... [NO] ....... [NO]
                                     ^^^^^

tjruwase commented 3 years ago

In that case, it seems pretty trivial to detect. How about below:

Python 3.6.9 (default, Jul 17 2020, 12:50:27)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import deepspeed
>>> from deepspeed.ops.aio import AsyncIOBuilder
>>> AsyncIOBuilder().is_compatible()
 [WARNING]  async_io requires the libraries: ['libaio-dev'] but are missing.
False
>>>

stas00 commented 3 years ago

Sounds perfect - so just asserting if it's False!

tjruwase commented 3 years ago

When libaio-dev is available, we get

Python 3.6.9 (default, Jul 17 2020, 12:50:27)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import deepspeed
>>> from deepspeed.ops.aio import AsyncIOBuilder
>>> assert AsyncIOBuilder().is_compatible()
>>> AsyncIOBuilder().is_compatible()
True
>>>

tjruwase commented 3 years ago

Okay, does this work:

 bash run_write_sweep.sh 400 /local_nvme/aio_test_write /tmp/write_400MB_1p
 [WARNING]  async_io requires the libraries: ['libaio-dev'] but are missing.
Traceback (most recent call last):
  File "./validate_async_io.py", line 3, in <module>
    assert AsyncIOBuilder().is_compatible()
AssertionError
Failing because environment is not properly configured

The benchmark calls this validation script:

cat validate_async_io.py
import deepspeed
from deepspeed.ops.aio import AsyncIOBuilder
assert AsyncIOBuilder().is_compatible()

stas00 commented 3 years ago

Excellent! can we expand the error to explain how to correct this?

stas00 commented 3 years ago

@tjruwase, here is one more data point. Thanks to @PeterAJansen for letting me use his rig to run the benchmarks.

8.0TB Intel SSD DC P4510 Series U.2 PCIe 3.1 x4 NVMe Solid State Drive

Read (up to) 3200 MB/s
Write (up to) 3000 MB/s

This drive is shared with the OS.

Took forever to run:

read 01:20:00
write 01:20:00

python parse_aio_stats.py --logdir read-logs/aio_perf_sweep --metric read_speed | sort -k9 -n | tail -10
('read', 'block', 'sequential', 4, 1, 16, 524288) = 22.845748667482386
('read', 'block', 'overlap', 4, 1, 4, 1048576) = 22.91260235083524
('read', 'block', 'overlap', 8, 1, 8, 262144) = 23.101150369327243
('read', 'single', 'overlap', 8, 1, 4, 524288) = 23.179753085546434
('read', 'single', 'sequential', 4, 1, 16, 524288) = 23.269917694596533
('read', 'block', 'overlap', 4, 1, 16, 524288) = 23.735983542433285
('read', 'block', 'sequential', 4, 1, 32, 524288) = 23.98723335724639
('read', 'single', 'overlap', 4, 1, 4, 1048576) = 24.020202538512006
('read', 'block', 'overlap', 4, 1, 32, 262144) = 24.36219131844153
('read', 'single', 'overlap', 8, 1, 4, 1048576) = 24.74214349355925
(py38-pt18) ✔ ~/hf/DeepSpeed/csrc/aio/py_test [master ↓·1|…5]
17:09 $ python parse_aio_stats.py --logdir write-logs/aio_perf_sweep --metric write_speed | sort -k10 -n | tail -10
('write', '400MB', 'single', 'overlap', 2, 1, 4, 524288) = 1.636919406396914
('write', '400MB', 'block', 'overlap', 2, 1, 32, 1048576) = 1.6438171995812199
('write', '400MB', 'block', 'overlap', 2, 1, 1, 524288) = 1.656672266766299
('write', '400MB', 'single', 'overlap', 2, 1, 32, 1048576) = 1.6900269430564896
('write', '400MB', 'block', 'sequential', 2, 1, 32, 1048576) = 1.6957397436932433
('write', '400MB', 'single', 'sequential', 2, 1, 32, 1048576) = 1.7143051461067413
('write', '400MB', 'block', 'sequential', 2, 1, 2, 131072) = 1.728030341782759
('write', '400MB', 'block', 'sequential', 2, 1, 1, 524288) = 1.7425579142058247
('write', '400MB', 'block', 'sequential', 2, 1, 2, 262144) = 1.7809131201578299
('write', '400MB', 'block', 'sequential', 2, 1, 1, 1048576) = 1.855609271024396

The read results make no sense - 10 times faster than advertised?

the fastest raw data output is:

Testing deepspeed_aio python frontend
args = Namespace(block_size='1M', gpu=False, handle=True, io_parallel=1, loops=1, overlap_events=True, queue_depth=4, read_file='input.file', single_submit=True, threads=8, validate=False, write_file=None, write_size=None)
Using /home/stas/.cache/torch_extensions as PyTorch extensions root...
Emitting ninja build file /home/stas/.cache/torch_extensions/async_io/build.ninja...
Building extension module async_io...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Using /home/stas/.cache/torch_extensions as PyTorch extensions root...
Emitting ninja build file /home/stas/.cache/torch_extensions/async_io/build.ninja...
Building extension module async_io...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Using /home/stas/.cache/torch_extensions as PyTorch extensions root...
Loading extension module async_io...
Time to load async_io op: 1.550898790359497 seconds
tid 0: schedule = {'pre': <function pre_handle_read at 0x7fcfb7bcd280>, 'post': <function post_handle at 0x7fcfb7bcd3a0>, 'main': <function main_parallel_read at 0x7fcfb7bcd430>}
tid 0: running pre-task
tid 0: Allocate tensor of size 419430400 bytes
tid 0: Read file input.file of size 419430400 bytes from buffer on device cpu
Using /home/stas/.cache/torch_extensions as PyTorch extensions root...
Loading extension module async_io...
Time to load async_io op: 0.9619369506835938 seconds
tid 0: created deepspeed aio handle
tid 0: running main task 0
tid 0: running post-task
Using /home/stas/.cache/torch_extensions as PyTorch extensions root...
Loading extension module async_io...
Time to load async_io op: 0.36820435523986816 seconds
Using /home/stas/.cache/torch_extensions as PyTorch extensions root...
Loading extension module async_io...
Time to load async_io op: 0.8932552337646484 seconds
Loading extension module async_io...
Time to load async_io op: 0.9489636421203613 seconds
Loading extension module async_io...
Time to load async_io op: 0.8349876403808594 seconds
Using /home/stas/.cache/torch_extensions as PyTorch extensions root...
Loading extension module async_io...
Time to load async_io op: 1.5706250667572021 seconds
Using /home/stas/.cache/torch_extensions as PyTorch extensions root...
Loading extension module async_io...
Time to load async_io op: 1.5751519203186035 seconds
Task Read Latency = 0.12534189224243164 sec
Task Read Speed = 24.931808065859904 GB/sec
E2E Read Latency = 0.12630271911621094 sec
E2E Read Speed = 24.74214349355925 GB/sec

stas00 commented 3 years ago

Also I wonder if we can find a way to run the read benchmark w/o needing sudo - I think some users might not be able to acquire sudo privileges to run the benchmark. I had to ask for it to be able to run the benchmark on someone's machine.

Since we only have one config for both read+write and write typically should be the slowest one, perhaps those could just skip the read benchmark?

tjruwase commented 3 years ago

@stas00, thanks for sharing this strange read perf. I suspect something is very wrong, especially since write is 24X worse than re ads. Can we try the following things:

Inspect the single-process read results. These would be ones with 1 as the first number in the results tuple. For example, the best read result is an 8-process run ('read', 'single', 'overlap', 8, 1, 4, 1048576) = 24.74214349355925.

Run fio using the following config, since fio should be an upper-bound on performance. setup: sudo apt-get install fio run: fio config.fio


[global]
bs=1M
iodepth=4
direct=1
ioengine=libaio
group_reporting
time_based
runtime=120
numjobs=8
name=raw-read
rw=read
directory=/local_nvme/
thread

[job1] filename=random_400MB.pt

tjruwase commented 3 years ago

Also I wonder if we can find a way to run the read benchmark w/o needing sudo - I think some users might not be able to acquire sudo privileges to run the benchmark. I had to ask for it to be able to run the benchmark on someone's machine.

This is a bit tricky because sudo is generally required to disable disk cache which is required to measure accurate I/O perf. We can optionally skip cache disabling if the user runs without sudo. This may not be so bad because in the real workload, disk cache would be part of the steady-state environment anyways. Could that work?

Since we only have one config for both read+write and write typically should be the slowest one, perhaps those could just skip the read benchmark?

Yes, this makes sense.

stas00 commented 3 years ago

@stas00, thanks for sharing this strange read perf. I suspect something is very wrong, especially since write is 24X worse than re ads. Can we try the following things:

I don't think it's the write that's the problem. The advertised speed is ~3GB/s for both. So write is in that order of magnitude, but 2x slower than advertised. Read is the one that's broken. I thought perhaps it was a raid, but nothing indicates that and write should be then about as fast as read.

Inspect the single-process read results. These would be ones with 1 as the first number in the results tuple. For example, the best read result is an 8-process run ('read', 'single', 'overlap', 8, 1, 4, 1048576) = 24.74214349355925.

I'm not sure what you're asking. I pasted this exact result here https://github.com/microsoft/DeepSpeed/issues/998#issuecomment-828044091

Run fio using the following config, since fio should be an upper-bound on performance. setup: sudo apt-get install fio run: fio config.fio
[global]
bs=1M
iodepth=4
direct=1
ioengine=libaio
group_reporting
time_based
runtime=120
numjobs=8
name=raw-read
rw=read
directory=/local_nvme/
thread

[job1]
filename=random_400MB.pt

fio ./config.fio
job1: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=4
...
fio-3.16
Starting 8 threads
job1: you need to specify size=
fio: pid=0, err=22/file:filesetup.c:1007, func=total_file_size, error=Invalid argument
job1: you need to specify size=
fio: pid=0, err=22/file:filesetup.c:1007, func=total_file_size, error=Invalid argument
job1: you need to specify size=
fio: pid=0, err=22/file:filesetup.c:1007, func=total_file_size, error=Invalid argument
job1: you need to specify size=
fio: pid=0, err=22/file:filesetup.c:1007, func=total_file_size, error=Invalid argument
job1: you need to specify size=
fio: pid=0, err=22/file:filesetup.c:1007, func=total_file_size, error=Invalid argument
job1: you need to specify size=
fio: pid=0, err=22/file:filesetup.c:1007, func=total_file_size, error=Invalid argument
job1: you need to specify size=
fio: pid=0, err=22/file:filesetup.c:1007, func=total_file_size, error=Invalid argument
job1: you need to specify size=
fio: pid=0, err=22/file:filesetup.c:1007, func=total_file_size, error=Invalid argument

I think it's asking for size= config - what should I give it?

stas00 commented 3 years ago

Also I wonder if we can find a way to run the read benchmark w/o needing sudo - I think some users might not be able to acquire sudo privileges to run the benchmark. I had to ask for it to be able to run the benchmark on someone's machine.

This is a bit tricky because sudo is generally required to disable disk cache which is required to measure accurate I/O perf. We can optionally skip cache disabling if the user runs without sudo. This may not be so bad because in the real workload, disk cache would be part of the steady-state environment anyways. Could that work?

we could simply run the same benchmark w/ and w/o cache flush and compare
if the user can't run sudo well, then that' the best they can do in that situation.
the only other workaround I can think of is creating a new random file before each benchmark, which should work too.

Do we have to use such large files for a good benchmark? would 10 or 50MB be representative enough?

If at the end we don't do 400 checks but say 50 then creating a new random file might be fast enough. cache flashing takes time too.

Alternatively could we create the data on the fly? then it'd be even faster - just feed X bytes from /dev/urandom

Since we only have one config for both read+write and write typically should be the slowest one, perhaps those could just skip the read benchmark?

Yes, this makes sense.

Except as we see now, we want to check both. as is the case with the strange results on the currently discussed NVMe.

tjruwase commented 3 years ago

I think it's asking for size= config - what should I give it?

My bad. I forgot to mention that you need to update filename=random_400MB.pt in the config to refer to a valid file. Can you try doing that?

stas00 commented 3 years ago

same error

stas00 commented 3 years ago

so the solution to make fio work for read mode was to remove filename=input.file, so:

cat config-read.fio 
[global]
bs=256K
iodepth=8
direct=1
ioengine=libaio
group_reporting
time_based
runtime=120
numjobs=8
name=raw-write
rw=read
size=400MB
directory=/local_nvme
thread

[job1]

and fio the results were identical to @tjruwase's benchmark - 1.6GB/s for write and 22GB/s for read.

We suspect some other caching is happening and echo 1 > /proc/sys/vm/drop_caches wasn't flushing the cache. So we aren't sure how to proceed in this case.

p.s. to run, just do fio config-read.fio

stas00 commented 3 years ago

I have a new datapoint:

Seagate FireCuda 510 - 2 Tb

advertised sequential:

read 3.45 GB/s
write 3.2 GB/s

benchmark:

('read', 'block', 'overlap', 8, 1, 16, 524288) = 3.1826879182571366
('read', 'single', 'sequential', 4, 1, 32, 524288) = 3.1830187194683055
('read', 'single', 'sequential', 8, 1, 4, 524288) = 3.183159408111869
('read', 'single', 'sequential', 8, 1, 32, 524288) = 3.183571496406362
('read', 'single', 'overlap', 8, 1, 32, 524288) = 3.1842559687793655
('read', 'block', 'sequential', 8, 1, 4, 1048576) = 3.186065629957262
('read', 'single', 'sequential', 8, 1, 8, 524288) = 3.1869479852411087
('read', 'single', 'sequential', 8, 1, 16, 524288) = 3.1871184701481194
('read', 'block', 'overlap', 8, 1, 32, 524288) = 3.1883115998575535
('read', 'block', 'sequential', 8, 1, 32, 262144) = 3.1886552069504663
('write', '400MB', 'block', 'sequential', 8, 1, 1, 524288) = 2.8125252129907627
('write', '400MB', 'single', 'overlap', 4, 1, 16, 1048576) = 2.8150501405033723
('write', '400MB', 'block', 'overlap', 8, 1, 4, 262144) = 2.8169207809802828
('write', '400MB', 'single', 'overlap', 8, 1, 4, 524288) = 2.8170436816718287
('write', '400MB', 'single', 'overlap', 4, 1, 8, 262144) = 2.817757686425568
('write', '400MB', 'single', 'overlap', 4, 1, 32, 1048576) = 2.81786066862275
('write', '400MB', 'single', 'sequential', 8, 1, 1, 1048576) = 2.8204170374948143
('write', '400MB', 'block', 'sequential', 8, 1, 4, 524288) = 2.822379293823784
('write', '400MB', 'block', 'overlap', 8, 1, 8, 524288) = 2.8248610705514827
('write', '400MB', 'single', 'overlap', 8, 1, 1, 524288) = 2.8253707393748972

tjruwase commented 3 years ago

@stas00, I wanted to bring this back to the fore. I am working on the two major feedback listed below, please correct me if I missed anything.

New sweep script with following features:
1. Default no sudo requirement, and thus no cache disabling. Cache disabling is command-line option.
2. Default shorter sweep, with json config option to expand search space. Default search space of 18 is the following
```
{
"block_size": ["128K", "256K", "1M"],
"queue_depth": [4, 8, 32],
"threads": [8, 16],
"overlap_events": [true],
"single_submit": [false],
"io_parallel": [1]
}
```
New analysis script to generate aioparam section by analyzing read and write log files.

I have finished (1) which outputs the following on -h:

usage: perf_sweep.py [-h] [--sweep_config SWEEP_CONFIG]
                     [--read_file READ_FILE] [--write_file WRITE_FILE]
                     [--write_size WRITE_SIZE] [--disable_cache] --log_dir
                     LOG_DIR

optional arguments:
  -h, --help            show this help message and exit
  --sweep_config SWEEP_CONFIG
                        Performance sweep configuration file (json).
  --read_file READ_FILE
                        File to read for performance measurements.
  --write_file WRITE_FILE
                        File to write for performance measurements.
  --write_size WRITE_SIZE
                        Number of bytes to write.
  --disable_cache       Disable page cache, requires sudo access.
  --log_dir LOG_DIR     Output directory for performance log files.

As always, I will appreciate your highly valued thoughts.

stas00 commented 3 years ago

Looks great, @tjruwase - awesome work!

Could we use good defaults for all of these so that in the simple case the user just needs to run:

perf_sweep.py
generate_aio_param.py # hallucinating that would be the name of (2) above.

Using aio_bench_ or some other very unique prefix naming should be pretty safe.

Read file can be generated on the fly if it's not there already, with a printout

Generating read file, which may take a minute or two:
dd if=/dev/urandom of=aio-read.file count=400 bs=1M

so that the user know the program isn't hanging...

Do you think we should leave cache flush On by default? I think the only thing here is to print once if this option is chosen that sudo is expected so it won't surprise the user. And also in that same message what to do if one doesn't have sudo access? e.g.:
```
To disable caching and get more precise results you will be asked for your `sudo` password once.
If you don't have `sudo` access, please use `--disable_cache False` option.
```

except this won't work for the flag you designed, ok, then perhaps:

optional arguments:
[...]
  --no_sudo       when having no sudo use this, at the cost of read being cached and faster 
                         than real speed reported.

And then:

To disable caching and get more precise results you will be asked for your `sudo` password once. 
If you don't have `sudo` access, please use `--no_sudo` option, except the read results will be reported faster than they are.

Bottom line, no sudo should be the last resort, since the read results won't be correct w/o it.

tjruwase commented 3 years ago

Thanks for the feedback. I am digesting them.

I think we can get good defaults for most things, but I think the user still needs to specify a folder path of the device. Even if the script will generate the read file as suggested below, we still need a destination folder, right? Or what did you have in mind?

Read file can be generated on the fly if it's not there already, with a printout
Generating read file, which may take a minute or two:
dd if=/dev/urandom of=aio-read.file count=400 bs=1M

stas00 commented 3 years ago

Well, until now I was just copying the whole benchmark into the folder on NVMe, if you look at the instructions in OP, it says:

cd /somewhere/on/nvme/drive/you/want/to/test
git clone https://github.com/microsoft/DeepSpeed
cd DeepSpeed

and then it's just a local folder

But, you're correct that my approach might not be ideal and it's the best to require a single argument that is a path on a nvme drive so it's explicit, which can be . if one followed my suggestion.

So perhaps:

perf_sweep.py /mnt/nvme1/benchmark
generate_aio_param.py /mnt/nvme1/benchmark

tjruwase commented 3 years ago

@stas00, based on your feedback, what do you think of a different argument approach:

--io_path <folder on device to evaluated> 
--io_size  <io transfer size [default 400M]>
--read [boolean flag to measure read performance][default True] 
--write [boolean flag to measure write performance][default True]

With this proposal the only required argument will be the --io_path. Also it requires one call to perform read and write tests.

stas00 commented 3 years ago

That looks perfect, @tjruwase!

microsoft / DeepSpeed