Open stas00 opened 3 years ago
Benchmark results from DGX-2 node, which has 8 Micron 9200 NVMe raided into a single volume.
Peak | 1-process | multi-process | |
---|---|---|---|
Read | 28 | 25.3 | 25.6 |
Write | 24.8 | 19.2 | 21.7 |
The sweep results suggest that zero-infinity can be configured to do offloading at read rate of 3GB/sec and write rate of 2.6GB/sec. So you want to configure the asynchronous I/O module similar to configurations that achieve these numbers. Specifically, you want to add the following to deepspeed config:
aio: {
"block_size": 262144,
"queue_depth": 32,
"thread_count": 1,
"single_submit": false,
"overlap_events": true
}
Unfortunately, I just noticed a bug in the write sweep script, which may lower the write perf. Basically, it is not doing the multi-process sweep because of this oversight.
I will merge #1001 to address this issue asap. This PR also avoids deleting the log folder, but rather creates an aio_perf_sweep
subfolder. This subfolder is deleted on reruns though.
The sweep results suggest that zero-infinity can be configured to do offloading at read rate of 3GB/sec and write rate of 2.6GB/sec. So you want to configure the asynchronous I/O module similar to configurations that achieve these numbers. Specifically, you want to add the following to deepspeed config:
aio: { "block_size": 262144, "queue_depth": 32, "thread_count": 1, "single_submit": false, "overlap_events": true }
Oh, I missed that new config section!
So how does a user correlate their benchmark results to the above config that you prepared based on my benchmark results?
The description of each param at asynchronous I/O module is very terse and ideally we need a few paras explaining how to choose those values. Which are good defaults, and which numbers should be changed according to one's results.
Thank you!
I looked again through your paper - please correct me if I'm wrong, but it looks like we need at least 3GB/s NVME<->CPU bandwidth per GPU, so really my NVME setup is only barely good for a single GPU to meet the efficiency standard you define in the paper. Am I wrong?
Especially given your report earlier:
Benchmark results from DGX-2 node, which has 8 Micron 9200 NVMe raided into a single volume. Peak 1-process multi-process Read 28 25.3 25.6 Write 24.8 19.2 21.7
So practically it's not enough to just have a single NVMe to benefit from ZeRO-Infinity if I understand it correctly.
1) I just added the section into the docs based on your feedback, so you did not miss it.
2) Sorry, I was not clear on how I came up with
aio: {
"block_size": 262144,
"queue_depth": 32,
"thread_count": 1,
"single_submit": false,
"overlap_events": true
}
thread_count=1
is reasonable default, since this is per-rank configuration.
The rest are based the results of your sweep as follows:
Your best read config was ('read', 'block', 'overlap', 1, 1, 32, 262144) = 3.168102406435208
, which corresponds to
single_submit=false, overlap_events=true, queue_depth=32, block_size=262144
.
Your best write config was ('write', '400MB', 'block', 'overlap', 8, 1, 32, 262144) = 2.5923189261116324
, corresponding to
single_submit=false, overlap_events=true, queue_depth=32, block_size=262144
.
Unfortunately, users don't currently have the ability to have separate read and write configurations, so they need to combine the best of both. Fortunately, in this case, and in most cases, the best read and write configurations are the same or similar.
Another challenge that is more obvious to me now is that reasonable defaults are hard to set because of device and system differences. Prior to your experiments, block_size=1M
had consistently seemed optimal across two clusters, but in your case, block_size=256K
seems to be optimal.
Does this help?
This helps a lot, thank you!
Can we make parse_aio_stats.py
take in both read and write reports and generate the recommended config for the user?
I looked again through your paper - please correct me if I'm wrong, but it looks like we need at least 3GB/s NVME<->CPU bandwidth per GPU, so really my NVME setup is only barely good for a single GPU to meet the efficiency standard you define in the paper. Am I wrong? So practically it's not enough to just have a single NVMe to benefit from ZeRO-Infinity if I understand it correctly.
Can you clarify the efficiency standard you are referring to in the paper?
Whether or not a single 3GB/sec NVMe is sufficient for ZeRO-Infinity depends on a number of factors including:
1) What is offloaded to NVMe. For example if only optimizer state is offloaded, then only the CPU would access NVMe.
2) How well the asynchronous NVMe accesses overlap with forward and backward (and optimizer update). ZeRO-Infinity leverages asynchrony to hide/minimize the NVMe latencies.
3) Model and batch sizes are since they increase forward and backward time more than they increase optimizer and NVMe time. We have seen cases where optimizer time (which includes NVMe) is much smaller than forward/backward time.
4) If possible then scaling to more nodes linearly increases the NVMe bandwidth
I looked again through your paper - please correct me if I'm wrong, but it looks like we need at least 3GB/s NVME<->CPU bandwidth per GPU, so really my NVME setup is only barely good for a single GPU to meet the efficiency standard you define in the paper. Am I wrong? So practically it's not enough to just have a single NVMe to benefit from ZeRO-Infinity if I understand it correctly.
Can you clarify the efficiency standard you are referring to in the paper?
And it was mentioned several times in previous sections.
Whether or not a single 3GB/sec NVMe is sufficient for ZeRO-Infinity depends on a number of factors including:
1. What is offloaded to NVMe. For example if only optimizer state is offloaded, then only the CPU would access NVMe. 2. How well the asynchronous NVMe accesses overlap with forward and backward (and optimizer update). ZeRO-Infinity leverages asynchrony to hide/minimize the NVMe latencies. 3. Model and batch sizes are since they increase forward and backward time more than they increase optimizer and NVMe time. We have seen cases where optimizer time (which includes NVMe) is much smaller than forward/backward time. 4. If possible then scaling to more nodes linearly increases the NVMe bandwidth
Thank you for sharing these considerations / questions to ask, @tjruwase.
How do we translate these into something actionable by the user. That is what exact steps they follow to set up each of these values:
"offload_optimizer": {
"device": "nvme",
"nvme_path": "/local_nvme",
"pin_memory": true,
"buffer_count": 4,
"fast_init": false
},
"offload_param": {
"device": "nvme",
"nvme_path": "/local_nvme",
"pin_memory": true,
"buffer_count": 5,
"buffer_size": 1e8,
"max_in_cpu": 1e9
}
besides device
, nvme_path
and pin_memory
which don't need explanation.
And you addressed how to get these numbers/flags - which is great!
"aio": {
"block_size": 262144,
"queue_depth": 32,
"thread_count": 1,
"single_submit": false,
"overlap_events": true
}
With the updated benchmark I get slightly worse results than before for write (was 2.59), and it has now switched to single_submit = true
as the best
python parse_aio_stats.py --logdir write-logs/aio_perf_sweep --metric write_speed | sort -k10 -n | tail -10
('write', '400MB', 'single', 'overlap', 1, 1, 4, 1048576) = 2.542780607573091
('write', '400MB', 'block', 'overlap', 1, 1, 32, 1048576) = 2.549606370281151
('write', '400MB', 'single', 'overlap', 1, 1, 16, 524288) = 2.560568126052968
('write', '400MB', 'block', 'overlap', 1, 1, 16, 1048576) = 2.5607282070838893
('write', '400MB', 'single', 'overlap', 1, 1, 8, 524288) = 2.569547474836188
('write', '400MB', 'block', 'overlap', 1, 1, 8, 524288) = 2.577944913420765
('write', '400MB', 'block', 'overlap', 1, 1, 4, 262144) = 2.580567932852312
('write', '400MB', 'single', 'overlap', 1, 1, 4, 262144) = 2.584932481576203
('write', '400MB', 'block', 'overlap', 1, 1, 32, 262144) = 2.5864627469800396
('write', '400MB', 'single', 'overlap', 1, 1, 32, 262144) = 2.586675086832965
The read benchmark hasn't changed the best throughput, but the winning config has changed too!
('read', 'block', 'overlap', 1, 1, 32, 262144) = 3.159043494691866
('read', 'block', 'overlap', 1, 1, 4, 1048576) = 3.1590617679099946
('read', 'block', 'overlap', 1, 1, 8, 1048576) = 3.1595369457938087
('read', 'single', 'overlap', 1, 1, 8, 262144) = 3.1604938271604937
('read', 'block', 'overlap', 1, 1, 8, 262144) = 3.1612316918107815
('read', 'block', 'overlap', 1, 1, 16, 524288) = 3.1612926877741097
('read', 'single', 'overlap', 1, 1, 8, 524288) = 3.1613170868185194
('read', 'block', 'overlap', 1, 1, 8, 524288) = 3.1615855011664906
('read', 'single', 'overlap', 1, 1, 16, 131072) = 3.1634717867128006
('read', 'single', 'overlap', 1, 1, 32, 131072) = 3.1637100215689946
So am I correct that I now need to change my config to:
"single_submit": true,
But what should I do with block_size
- the read benchmark is at 131072
, whereas the write one is at 262144
- how do we reconcile this?
Also, why does the read benchmark runs echo 1 > /proc/sys/vm/drop_caches
, but the write one doesn't? Is it not necessary because it writes and the cache is then always fully invalidated?
@tjruwase, what do you think about putting the 400MB
column last or removing it completely - since it's always the same number it doesn't tell anything to the user? Then it'd be easier to see the 2 sets aligned. Or alternatively have the same column for read too?
@stas00, regarding the changing best configs and results, I think since the perf differences are so small I would put it down to noise. Also, as you notice the top 10 (or more) perfs are very similar, so it seems to me that the most significant factors are overlap=true, queue_depth >=4, block_size >= 256K
. All of these observations strengthen your call for users to be shielded from this information overload. So I will focus on having the script generate an optimal aio config setting, while hiding the optimality logic in the script. Does that sound reasonable.
Regarding the write benchmark, not dropping the caches, I did not do that because I thought it was more significant for reads as opposed to writes. If possible, can you please check if adding dropping caches to write benchmark makes much difference? I will update the script in my next PR.
@stas00, regarding the changing best configs and results, I think since the perf differences are so small I would put it down to noise. Also, as you notice the top 10 (or more) perfs are very similar, so it seems to me that the most significant factors are
overlap=true, queue_depth >=4, block_size >= 256K
. All of these observations strengthen your call for users to be shielded from this information overload. So I will focus on having the script generate an optimal aio config setting, while hiding the optimality logic in the script. Does that sound reasonable.
That sounds fantastic! Thank you, @tjruwase
I'd also add that since we currently have a single config perhaps the final script should take the output of both parsers? or take read and write log dirs , run the parser on them and dump the recommended config, so the user will need to run:
and of course the first 2 can also be merged into the 3rd item as the next stage. But this would be a great start.
Regarding the write benchmark, not dropping the caches, I did not do that because I thought it was more significant for reads as opposed to writes. If possible, can you please check if adding dropping caches to write benchmark makes much difference? I will update the script in my next PR.
I will test and get back to you.
oh and we probably should have the instruction sudo ./run_read_sweep.sh input.file read-logs
so the sudo
prompt doesn't come as a strange surprise after the script started.
OK, so adding invalidating caching for write, had a negligible impact of 1e-2 difference. Hence, it's probably safe to not need it (as it slows the overall run time as well I think).
('write', '400MB', 'block', 'overlap', 1, 1, 2, 524288) = 2.5284379827435344
('write', '400MB', 'single', 'overlap', 1, 1, 8, 1048576) = 2.536109060119592
('write', '400MB', 'block', 'overlap', 1, 1, 4, 524288) = 2.5423465809286765
('write', '400MB', 'block', 'overlap', 1, 1, 8, 1048576) = 2.551528129258322
('write', '400MB', 'single', 'overlap', 1, 1, 32, 524288) = 2.5574265894943213
('write', '400MB', 'single', 'overlap', 1, 1, 4, 524288) = 2.572638084590551
('write', '400MB', 'block', 'overlap', 1, 1, 32, 524288) = 2.575145071954432
('write', '400MB', 'block', 'overlap', 1, 1, 16, 262144) = 2.5767529201574613
('write', '400MB', 'single', 'overlap', 1, 1, 16, 262144) = 2.577214990758583
('write', '400MB', 'block', 'overlap', 1, 1, 8, 262144) = 2.583110769162854
Here are results from an A100 hyperplane server from Lambda, using DeepSpeed master and the instructions collected above!
Micron 7300 2TB NVMe (Max Read 3GB/s, Max Write 1.9 GB/s)
('read', 'block', 'overlap', 2, 1, 8, 524288) = 2.0683565061893523
('read', 'block', 'overlap', 2, 1, 16, 524288) = 2.0690931110843103
('read', 'block', 'overlap', 2, 1, 16, 1048576) = 2.071279891429738
('read', 'block', 'sequential', 2, 1, 16, 1048576) = 2.0751389262701263
('read', 'block', 'sequential', 2, 1, 32, 524288) = 2.0761578914021417
('read', 'block', 'sequential', 2, 1, 32, 1048576) = 2.0790717269594086
For most of the configurations the difference is negligible. Range of 0.33-2.07 GB/s
('write', '400MB', 'block', 'sequential', 1, 1, 4, 131072) = 1.9950197565890813
Again looking at the first 100ish tail outputs, the difference is negligible. Range of 1.23-1.995 GB/s
I think we can potentially reduce the grid search space to hone in on suggestions initially. it might also be a good idea to compare the same configurations across our 3 environments to see what the differential compared to the max throughput is
@SeanNaren, thanks for sharing your results. This is great since it is a different device from @stas00 and myself, so this is helping to evaluate the perf portability. Please see a few observations and questions below.
Am I reading correctly that the peak hardware write rate is 1.9GB/s, while the benchmark is reporting 1.99?
The 67% read perf of peak rate is much lower than @stas00 and I have observed, and I would like some help further understanding what is going on. I will defer the questions to the end of this post.
I agree that reducing the search space is critical, as @stas00 already noted. However, your results which show sequential > overlap
deviates from our observations, making it harder for me to confidently propose things to prune. So far the following seems like the reasonable reduced space: block, [sequential|overlap], queue_depth=[4,8,16], block_size>=[256K,512K,1M]
. What do you guys (@stas00) think?
Regarding further investigation of the relatively poor read performance, can you help with the following:
Did the benchmark have exclusive access to the device, or is the device shared with other system components, such as the OS?
Can you run an equivalent fio experiment to the best read configuration using the following fio config file:
[global]
bs=1M
iodepth=32
direct=1
ioengine=libaio
group_reporting
time_based
runtime=120
numjobs=2
name=raw-read
rw=read
directory=/local_nvme/
thread
[job1] filename=random_400MB.pt
3. I agree that reducing the search space is critical, as @stas00 already noted. However, your results which show
sequential > overlap
deviates from our observations, making it harder for me to confidently propose things to prune. So far the following seems like the reasonable reduced space:block, [sequential|overlap], queue_depth=[4,8,16], block_size>=[256K,512K,1M]
. What do you guys (@stas00) think?
Perhaps we should try and find a few more devices to acquire more data before we prune? Currently Sean's device seems to be one-off - basing a choice on 3 inputs is a bit hard and we might miss something.
@stas00, I completely agree. It would be awesome to find additional devices. Unfortunately, I don't have access to such diversity here. Also by pruning, I meant only the default setting of the search script, users will have the option of defining their own space.
I added "4. Contribute your data" instructions to the OP - let's see if we can get some contributions.
I made a call to community inviting to contribute: https://discuss.huggingface.co/t/deepspeed-zero-infinity-looking-for-nvme-device-benchmarks/5787
I need to use sparse-attn/support-latest-triton
as I'm running an RTX 3090
Setup is OK (as per below) but none of the DS optimisers will JIT compile (they all return an error at runtime re: the sm_86 arch).
Pretty sure problem is still outstanding for this card
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
Thank you for wanting to help us to gather the data, @thefazzer!
I have the same card, it works without problems if you have the right torch/cuda setup.
Let's not derail this thread and discuss this in a separate issue here on Deepspeed? Please post the output of your python -m torch.utils.collect_env
and tag @stas00
so I won't miss it - and I will help you to get it working and then we can get back here.
Thank you!
- Am I reading correctly that the peak hardware write rate is 1.9GB/s, while the benchmark is reporting 1.99?
I looked all over the internet for this specific NVMe, and the reported max write was 1.9GB/s and these were the numbers collated from my run! I can confirm with Lambda on this.
- I agree that reducing the search space is critical, as @stas00 already noted. However, your results which show
sequential > overlap
deviates from our observations, making it harder for me to confidently propose things to prune. So far the following seems like the reasonable reduced space:block, [sequential|overlap], queue_depth=[4,8,16], block_size>=[256K,512K,1M]
. What do you guys (@stas00) think?
I haven't looked into this, but cloud providers may provide another standard for NVMe if we're trying to just collected data numbers.
Regarding further investigation of the relatively poor read performance, can you help with the following:
- Did the benchmark have exclusive access to the device, or is the device shared with other system components, such as the OS?
This is shared with the OS as a root drive, there was nothing else running on the node (other than sys processes) at the time of running this benchmark.
- Can you run an equivalent fio experiment to the best read configuration using the following fio config file:
More than happy to, however unsure how to run this! Maybe I missed something but if you could explain, can run it!
@SeanNaren, we have seen that device sharing with OS as a root drive does impact performance, especially read, even if nothing else is running on the node.
For fio, update the directory and filename fields of config file (e.g,config.fio
), then install and run as follows:
setup: sudo apt-get install fio
run: fio config.fio
@tjruwase, could we please change the benchmark to bail out if it can't run with the error message, otherwise it dumps the error into the benchmark files. and it will do it for all 400 files w/o user knowing it's not really running...
cat write-logs/aio_perf_sweep/write_400MB_single_overlap_t1_p1_d1_bs128K.txt
Testing deepspeed_aio python frontend
args = Namespace(block_size='128K', gpu=False, handle=True, io_parallel=1, loops=1, overlap_events=True, queue_depth=1, read_file=None, single_submit=True, threads=1, validate=False, write_file='write-test-data/ds_aio_write_400MB.pt', write_size='400M')
tid 0: schedule = {'pre': <function pre_handle_write at 0x7f517fcfe790>, 'post': <function post_handle at 0x7f517fcfe820>, 'main': <function main_parallel_write at 0x7f517fcfe940>}
tid 0: running pre-task
tid 0: Allocate tensor of size 419430400 bytes
tid 0: Write file write-test-data/ds_aio_write_400MB.pt.0 of size 419430400 bytes from buffer on device cpu
[WARNING] async_io requires the libraries: ['libaio-dev'] but are missing.
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/stas/anaconda3/lib/python3.8/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/home/stas/anaconda3/lib/python3.8/multiprocessing/pool.py", line 48, in mapstar
return list(map(*args))
File "/home/stas/hf/DeepSpeed/csrc/aio/py_test/ds_aio_handle.py", line 144, in _aio_handle_tasklet
ctxt = schedule["pre"]((args, tid))
File "/home/stas/hf/DeepSpeed/csrc/aio/py_test/ds_aio_handle.py", line 57, in pre_handle_write
ctxt = pre_handle(args, tid, False)
File "/home/stas/hf/DeepSpeed/csrc/aio/py_test/ds_aio_handle.py", line 32, in pre_handle
handle = AsyncIOBuilder().load().aio_handle(args.block_size,
File "/home/stas/hf/DeepSpeed/deepspeed/ops/op_builder/builder.py", line 215, in load
return self.jit_load(verbose)
File "/home/stas/hf/DeepSpeed/deepspeed/ops/op_builder/builder.py", line 219, in jit_load
raise RuntimeError(
RuntimeError: Unable to JIT load the async_io op due to it not being compatible due to hardware/software issue.
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "./test_ds_aio.py", line 117, in <module>
main()
File "./test_ds_aio.py", line 113, in main
multiprocess_function(args, False)
File "/home/stas/hf/DeepSpeed/csrc/aio/py_test/ds_aio_handle.py", line 174, in aio_handle_multiprocessing
pool_results = p.map(_aio_handle_tasklet, pool_params)
File "/home/stas/anaconda3/lib/python3.8/multiprocessing/pool.py", line 364, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/home/stas/anaconda3/lib/python3.8/multiprocessing/pool.py", line 771, in get
raise self._value
RuntimeError: Unable to JIT load the async_io op due to it not being compatible due to hardware/software issue.
Also as I mentioned earlier the WARNING:
[WARNING] async_io requires the libraries: ['libaio-dev'] but are missing.
should probably be an assert
Thank you!
Hmm, let me take a look at both issues today.
oh, and btw, my earlier suggestion to move sudo
outside the shell script was a bad idea - since then it then tries to run from a different environment and chances are it won't work. So sudo
inside the read benchmark it is.
Perhaps should just warn the user with echo once of why we are asking to sudo
, perhaps:
echo FYI: if sudo password prompt pops up, this is to enable flashing of the io cache
@stas00, two points on error handling.
1) I am able to make benchmark break early with something like below. Is this sufficient?
bash run_write_sweep.sh 400 /local_nvme/aio_test_write /tmp/write_400MB_1p
sync; sudo bash -c 'echo 1 > /proc/sys/vm/drop_caches'
python ./test_ds_aio.py --write_file /local_nvme/aio_test_write/ds_aio_write_400MB.pt --write_size 400M --io_parallel 1 --queue_depth 1 --block_size 128K --single_submit --overlap_events --handle --threads 1 &> /tmp/write_400MB_1p/aio_perf_sweep/write_single_overlap_t1_p1_d1_bs128K.txt
sync
Benchmark failed - for error details examine /tmp/write_400MB_1p/aio_perf_sweep/write_single_overlap_t1_p1_d1_bs128K.txt
2) Turning the missing libaio-dev into an assert is problematic because there are high level codes (e.g., unit tests) that are able to handle this situation without failing. On the other hand, the perf script (test_ds_aio.py) is not one of such high level codes, so perhaps we could make it fail gracefully in such situations. On the other hand, perhaps (1) should take of this for the purpose of perf sweep runs. What do you think?
I propose a different approach, before running the benchmark have a test script that simply validates that everything in the env is ready? e.g. if it can't build the required extensions it would fail? that test script could be a small part of the benchmark where it needs all the ingredients to work?
So like ds_report
tell us that:
[WARNING] async_io requires the libraries: ['libaio-dev'] but are missing.
perhaps that same function that printed that warning can be used, except it'd assert?
And then this test script can be run at the beginning of both benchmarks.
Turning the missing libaio-dev into an assert is problematic because there are high level codes (e.g., unit tests) that are able to handle this situation without failing.
so as mentioned above - if ds_report
knows it - shouldn't that be enough to assert?
op name ................ installed .. compatible
--------------------------------------------------
[WARNING] async_io requires the libraries: ['libaio-dev'] but are missing.
async_io ............... [NO] ....... [NO]
^^^^^
In that case, it seems pretty trivial to detect. How about below:
Python 3.6.9 (default, Jul 17 2020, 12:50:27)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import deepspeed
>>> from deepspeed.ops.aio import AsyncIOBuilder
>>> AsyncIOBuilder().is_compatible()
[WARNING] async_io requires the libraries: ['libaio-dev'] but are missing.
False
>>>
Sounds perfect - so just asserting if it's False
!
When libaio-dev is available, we get
Python 3.6.9 (default, Jul 17 2020, 12:50:27)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import deepspeed
>>> from deepspeed.ops.aio import AsyncIOBuilder
>>> assert AsyncIOBuilder().is_compatible()
>>> AsyncIOBuilder().is_compatible()
True
>>>
Okay, does this work:
bash run_write_sweep.sh 400 /local_nvme/aio_test_write /tmp/write_400MB_1p
[WARNING] async_io requires the libraries: ['libaio-dev'] but are missing.
Traceback (most recent call last):
File "./validate_async_io.py", line 3, in <module>
assert AsyncIOBuilder().is_compatible()
AssertionError
Failing because environment is not properly configured
The benchmark calls this validation script:
cat validate_async_io.py
import deepspeed
from deepspeed.ops.aio import AsyncIOBuilder
assert AsyncIOBuilder().is_compatible()
Excellent! can we expand the error to explain how to correct this?
@tjruwase, here is one more data point. Thanks to @PeterAJansen for letting me use his rig to run the benchmarks.
8.0TB Intel SSD DC P4510 Series U.2 PCIe 3.1 x4 NVMe Solid State Drive
This drive is shared with the OS.
Took forever to run:
python parse_aio_stats.py --logdir read-logs/aio_perf_sweep --metric read_speed | sort -k9 -n | tail -10
('read', 'block', 'sequential', 4, 1, 16, 524288) = 22.845748667482386
('read', 'block', 'overlap', 4, 1, 4, 1048576) = 22.91260235083524
('read', 'block', 'overlap', 8, 1, 8, 262144) = 23.101150369327243
('read', 'single', 'overlap', 8, 1, 4, 524288) = 23.179753085546434
('read', 'single', 'sequential', 4, 1, 16, 524288) = 23.269917694596533
('read', 'block', 'overlap', 4, 1, 16, 524288) = 23.735983542433285
('read', 'block', 'sequential', 4, 1, 32, 524288) = 23.98723335724639
('read', 'single', 'overlap', 4, 1, 4, 1048576) = 24.020202538512006
('read', 'block', 'overlap', 4, 1, 32, 262144) = 24.36219131844153
('read', 'single', 'overlap', 8, 1, 4, 1048576) = 24.74214349355925
(py38-pt18) ✔ ~/hf/DeepSpeed/csrc/aio/py_test [master ↓·1|…5]
17:09 $ python parse_aio_stats.py --logdir write-logs/aio_perf_sweep --metric write_speed | sort -k10 -n | tail -10
('write', '400MB', 'single', 'overlap', 2, 1, 4, 524288) = 1.636919406396914
('write', '400MB', 'block', 'overlap', 2, 1, 32, 1048576) = 1.6438171995812199
('write', '400MB', 'block', 'overlap', 2, 1, 1, 524288) = 1.656672266766299
('write', '400MB', 'single', 'overlap', 2, 1, 32, 1048576) = 1.6900269430564896
('write', '400MB', 'block', 'sequential', 2, 1, 32, 1048576) = 1.6957397436932433
('write', '400MB', 'single', 'sequential', 2, 1, 32, 1048576) = 1.7143051461067413
('write', '400MB', 'block', 'sequential', 2, 1, 2, 131072) = 1.728030341782759
('write', '400MB', 'block', 'sequential', 2, 1, 1, 524288) = 1.7425579142058247
('write', '400MB', 'block', 'sequential', 2, 1, 2, 262144) = 1.7809131201578299
('write', '400MB', 'block', 'sequential', 2, 1, 1, 1048576) = 1.855609271024396
The read results make no sense - 10 times faster than advertised?
the fastest raw data output is:
Testing deepspeed_aio python frontend
args = Namespace(block_size='1M', gpu=False, handle=True, io_parallel=1, loops=1, overlap_events=True, queue_depth=4, read_file='input.file', single_submit=True, threads=8, validate=False, write_file=None, write_size=None)
Using /home/stas/.cache/torch_extensions as PyTorch extensions root...
Emitting ninja build file /home/stas/.cache/torch_extensions/async_io/build.ninja...
Building extension module async_io...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Using /home/stas/.cache/torch_extensions as PyTorch extensions root...
Emitting ninja build file /home/stas/.cache/torch_extensions/async_io/build.ninja...
Building extension module async_io...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Using /home/stas/.cache/torch_extensions as PyTorch extensions root...
Loading extension module async_io...
Time to load async_io op: 1.550898790359497 seconds
tid 0: schedule = {'pre': <function pre_handle_read at 0x7fcfb7bcd280>, 'post': <function post_handle at 0x7fcfb7bcd3a0>, 'main': <function main_parallel_read at 0x7fcfb7bcd430>}
tid 0: running pre-task
tid 0: Allocate tensor of size 419430400 bytes
tid 0: Read file input.file of size 419430400 bytes from buffer on device cpu
Using /home/stas/.cache/torch_extensions as PyTorch extensions root...
Loading extension module async_io...
Time to load async_io op: 0.9619369506835938 seconds
tid 0: created deepspeed aio handle
tid 0: running main task 0
tid 0: running post-task
Using /home/stas/.cache/torch_extensions as PyTorch extensions root...
Loading extension module async_io...
Time to load async_io op: 0.36820435523986816 seconds
Using /home/stas/.cache/torch_extensions as PyTorch extensions root...
Loading extension module async_io...
Time to load async_io op: 0.8932552337646484 seconds
Loading extension module async_io...
Time to load async_io op: 0.9489636421203613 seconds
Loading extension module async_io...
Time to load async_io op: 0.8349876403808594 seconds
Using /home/stas/.cache/torch_extensions as PyTorch extensions root...
Loading extension module async_io...
Time to load async_io op: 1.5706250667572021 seconds
Using /home/stas/.cache/torch_extensions as PyTorch extensions root...
Loading extension module async_io...
Time to load async_io op: 1.5751519203186035 seconds
Task Read Latency = 0.12534189224243164 sec
Task Read Speed = 24.931808065859904 GB/sec
E2E Read Latency = 0.12630271911621094 sec
E2E Read Speed = 24.74214349355925 GB/sec
Also I wonder if we can find a way to run the read benchmark w/o needing sudo
- I think some users might not be able to acquire sudo privileges to run the benchmark. I had to ask for it to be able to run the benchmark on someone's machine.
Since we only have one config for both read+write and write typically should be the slowest one, perhaps those could just skip the read benchmark?
@stas00, thanks for sharing this strange read perf. I suspect something is very wrong, especially since write is 24X worse than re ads. Can we try the following things:
Inspect the single-process read results. These would be ones with 1
as the first number in the results tuple. For example, the best read result is an 8-process run ('read', 'single', 'overlap', 8, 1, 4, 1048576) = 24.74214349355925
.
Run fio using the following config, since fio should be an upper-bound on performance.
setup: sudo apt-get install fio
run: fio config.fio
[global]
bs=1M
iodepth=4
direct=1
ioengine=libaio
group_reporting
time_based
runtime=120
numjobs=8
name=raw-read
rw=read
directory=/local_nvme/
thread
[job1] filename=random_400MB.pt
Also I wonder if we can find a way to run the read benchmark w/o needing
sudo
- I think some users might not be able to acquire sudo privileges to run the benchmark. I had to ask for it to be able to run the benchmark on someone's machine.
This is a bit tricky because sudo is generally required to disable disk cache which is required to measure accurate I/O perf. We can optionally skip cache disabling if the user runs without sudo. This may not be so bad because in the real workload, disk cache would be part of the steady-state environment anyways. Could that work?
Since we only have one config for both read+write and write typically should be the slowest one, perhaps those could just skip the read benchmark?
Yes, this makes sense.
@stas00, thanks for sharing this strange read perf. I suspect something is very wrong, especially since write is 24X worse than re ads. Can we try the following things:
I don't think it's the write that's the problem. The advertised speed is ~3GB/s for both. So write is in that order of magnitude, but 2x slower than advertised. Read is the one that's broken. I thought perhaps it was a raid, but nothing indicates that and write should be then about as fast as read.
- Inspect the single-process read results. These would be ones with
1
as the first number in the results tuple. For example, the best read result is an 8-process run('read', 'single', 'overlap', 8, 1, 4, 1048576) = 24.74214349355925
.
I'm not sure what you're asking. I pasted this exact result here https://github.com/microsoft/DeepSpeed/issues/998#issuecomment-828044091
- Run fio using the following config, since fio should be an upper-bound on performance. setup:
sudo apt-get install fio
run:fio config.fio
[global] bs=1M iodepth=4 direct=1 ioengine=libaio group_reporting time_based runtime=120 numjobs=8 name=raw-read rw=read directory=/local_nvme/ thread [job1] filename=random_400MB.pt
fio ./config.fio
job1: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=4
...
fio-3.16
Starting 8 threads
job1: you need to specify size=
fio: pid=0, err=22/file:filesetup.c:1007, func=total_file_size, error=Invalid argument
job1: you need to specify size=
fio: pid=0, err=22/file:filesetup.c:1007, func=total_file_size, error=Invalid argument
job1: you need to specify size=
fio: pid=0, err=22/file:filesetup.c:1007, func=total_file_size, error=Invalid argument
job1: you need to specify size=
fio: pid=0, err=22/file:filesetup.c:1007, func=total_file_size, error=Invalid argument
job1: you need to specify size=
fio: pid=0, err=22/file:filesetup.c:1007, func=total_file_size, error=Invalid argument
job1: you need to specify size=
fio: pid=0, err=22/file:filesetup.c:1007, func=total_file_size, error=Invalid argument
job1: you need to specify size=
fio: pid=0, err=22/file:filesetup.c:1007, func=total_file_size, error=Invalid argument
job1: you need to specify size=
fio: pid=0, err=22/file:filesetup.c:1007, func=total_file_size, error=Invalid argument
I think it's asking for size=
config - what should I give it?
Also I wonder if we can find a way to run the read benchmark w/o needing
sudo
- I think some users might not be able to acquire sudo privileges to run the benchmark. I had to ask for it to be able to run the benchmark on someone's machine.This is a bit tricky because sudo is generally required to disable disk cache which is required to measure accurate I/O perf. We can optionally skip cache disabling if the user runs without sudo. This may not be so bad because in the real workload, disk cache would be part of the steady-state environment anyways. Could that work?
sudo
well, then that' the best they can do in that situation. Do we have to use such large files for a good benchmark? would 10 or 50MB be representative enough?
If at the end we don't do 400 checks but say 50 then creating a new random file might be fast enough. cache flashing takes time too.
Alternatively could we create the data on the fly? then it'd be even faster - just feed X bytes from /dev/urandom
Since we only have one config for both read+write and write typically should be the slowest one, perhaps those could just skip the read benchmark?
Yes, this makes sense.
Except as we see now, we want to check both. as is the case with the strange results on the currently discussed NVMe.
I think it's asking for
size=
config - what should I give it?
My bad. I forgot to mention that you need to update filename=random_400MB.pt
in the config to refer to a valid file. Can you try doing that?
same error
so the solution to make fio
work for read mode was to remove filename=input.file
, so:
cat config-read.fio
[global]
bs=256K
iodepth=8
direct=1
ioengine=libaio
group_reporting
time_based
runtime=120
numjobs=8
name=raw-write
rw=read
size=400MB
directory=/local_nvme
thread
[job1]
and fio
the results were identical to @tjruwase's benchmark - 1.6GB/s for write and 22GB/s for read.
We suspect some other caching is happening and echo 1 > /proc/sys/vm/drop_caches
wasn't flushing the cache. So we aren't sure how to proceed in this case.
p.s. to run, just do fio config-read.fio
I have a new datapoint:
advertised sequential:
benchmark:
('read', 'block', 'overlap', 8, 1, 16, 524288) = 3.1826879182571366
('read', 'single', 'sequential', 4, 1, 32, 524288) = 3.1830187194683055
('read', 'single', 'sequential', 8, 1, 4, 524288) = 3.183159408111869
('read', 'single', 'sequential', 8, 1, 32, 524288) = 3.183571496406362
('read', 'single', 'overlap', 8, 1, 32, 524288) = 3.1842559687793655
('read', 'block', 'sequential', 8, 1, 4, 1048576) = 3.186065629957262
('read', 'single', 'sequential', 8, 1, 8, 524288) = 3.1869479852411087
('read', 'single', 'sequential', 8, 1, 16, 524288) = 3.1871184701481194
('read', 'block', 'overlap', 8, 1, 32, 524288) = 3.1883115998575535
('read', 'block', 'sequential', 8, 1, 32, 262144) = 3.1886552069504663
('write', '400MB', 'block', 'sequential', 8, 1, 1, 524288) = 2.8125252129907627
('write', '400MB', 'single', 'overlap', 4, 1, 16, 1048576) = 2.8150501405033723
('write', '400MB', 'block', 'overlap', 8, 1, 4, 262144) = 2.8169207809802828
('write', '400MB', 'single', 'overlap', 8, 1, 4, 524288) = 2.8170436816718287
('write', '400MB', 'single', 'overlap', 4, 1, 8, 262144) = 2.817757686425568
('write', '400MB', 'single', 'overlap', 4, 1, 32, 1048576) = 2.81786066862275
('write', '400MB', 'single', 'sequential', 8, 1, 1, 1048576) = 2.8204170374948143
('write', '400MB', 'block', 'sequential', 8, 1, 4, 524288) = 2.822379293823784
('write', '400MB', 'block', 'overlap', 8, 1, 8, 524288) = 2.8248610705514827
('write', '400MB', 'single', 'overlap', 8, 1, 1, 524288) = 2.8253707393748972
@stas00, I wanted to bring this back to the fore. I am working on the two major feedback listed below, please correct me if I missed anything.
{
"block_size": ["128K", "256K", "1M"],
"queue_depth": [4, 8, 32],
"threads": [8, 16],
"overlap_events": [true],
"single_submit": [false],
"io_parallel": [1]
}
aio
param section by analyzing read and write log files.I have finished (1) which outputs the following on -h
:
usage: perf_sweep.py [-h] [--sweep_config SWEEP_CONFIG]
[--read_file READ_FILE] [--write_file WRITE_FILE]
[--write_size WRITE_SIZE] [--disable_cache] --log_dir
LOG_DIR
optional arguments:
-h, --help show this help message and exit
--sweep_config SWEEP_CONFIG
Performance sweep configuration file (json).
--read_file READ_FILE
File to read for performance measurements.
--write_file WRITE_FILE
File to write for performance measurements.
--write_size WRITE_SIZE
Number of bytes to write.
--disable_cache Disable page cache, requires sudo access.
--log_dir LOG_DIR Output directory for performance log files.
As always, I will appreciate your highly valued thoughts.
Looks great, @tjruwase - awesome work!
perf_sweep.py
generate_aio_param.py # hallucinating that would be the name of (2) above.
Using aio_bench_
or some other very unique prefix naming should be pretty safe.
Read file can be generated on the fly if it's not there already, with a printout
Generating read file, which may take a minute or two:
dd if=/dev/urandom of=aio-read.file count=400 bs=1M
so that the user know the program isn't hanging...
sudo
access? e.g.:
To disable caching and get more precise results you will be asked for your `sudo` password once.
If you don't have `sudo` access, please use `--disable_cache False` option.
except this won't work for the flag you designed, ok, then perhaps:
optional arguments:
[...]
--no_sudo when having no sudo use this, at the cost of read being cached and faster
than real speed reported.
And then:
To disable caching and get more precise results you will be asked for your `sudo` password once.
If you don't have `sudo` access, please use `--no_sudo` option, except the read results will be reported faster than they are.
Bottom line, no sudo should be the last resort, since the read results won't be correct w/o it.
Thanks for the feedback. I am digesting them.
I think we can get good defaults for most things, but I think the user still needs to specify a folder path of the device. Even if the script will generate the read file as suggested below, we still need a destination folder, right? Or what did you have in mind?
Read file can be generated on the fly if it's not there already, with a printout
Generating read file, which may take a minute or two: dd if=/dev/urandom of=aio-read.file count=400 bs=1M
Well, until now I was just copying the whole benchmark into the folder on NVMe, if you look at the instructions in OP, it says:
cd /somewhere/on/nvme/drive/you/want/to/test
git clone https://github.com/microsoft/DeepSpeed
cd DeepSpeed
and then it's just a local folder
But, you're correct that my approach might not be ideal and it's the best to require a single argument that is a path on a nvme drive so it's explicit, which can be .
if one followed my suggestion.
So perhaps:
perf_sweep.py /mnt/nvme1/benchmark
generate_aio_param.py /mnt/nvme1/benchmark
@stas00, based on your feedback, what do you think of a different argument approach:
--io_path <folder on device to evaluated>
--io_size <io transfer size [default 400M]>
--read [boolean flag to measure read performance][default True]
--write [boolean flag to measure write performance][default True]
With this proposal the only required argument will be the --io_path
. Also it requires one call to perform read and write tests.
That looks perfect, @tjruwase!
Let's use this issue to gather instructions on how to profile one's CPU<->NVMe setup.
(@tjruwase and I have been editing this post)
You need to do this on every new CPU/NVMe setup in order to configure: aio param section.
The following NVMe benchmark measures end-to-end performance of how fast it can read/write CPU<->NVMe. so make sure to test this on the actual system that you intend to use it on.
For this demonstration we are going to use:
1. Preparation
You may have to also install
libaio-dev
if the Deepspeed NVMe driver fails to build. On Ubuntu it's just:Depending on the speed of your NVMe, each benchmark could run for 30min or longer.
Important: make sure you're not doing any other I/O on the device you're testing or you will get incorrect results.
2. Run Read Benchmark
This benchmark assumes the current working directory is on the NVMe drive. If it's not, copy the
csrc/aio/py_test
folder to your NVMe drive and run the test there.You can, of course, use it to test non-NVMe drivers (e.g. SSD).
The tail of the list should show the fastest speeds.
Here is the best result for the read benchmark:
3. Run Write Benchmark
The write report best result:
4. Contribute your data
We need more read/write data for various devices to figure out how to make the configuration process automated.
If you're contributing your data, please post:
Important: please make sure not to do any other I/O on the device under benchmark.
5. Derive the
aio
params blockNow we need to figure out how to use the results of the benchmark to configure
aio
.Here is the final result:
Most of this config block values come from the benchmarks best results for read and write - i.e. which configuration gives us the highest GB/s throughput (the higher the number the better)
Schema of each line in results is as follows:
read or write | single or block event completion | overlap or sequential event submission | # processes | intra-process parallelism | queue depth | block size | GB/sec
The best read config was:
which corresponds to
single_submit=false, overlap_events=true, queue_depth=32, block_size=262144
single_submit=true
if the 2nd column issingle
instead ofblock
.overlap_events=false
if the 3rd column issequential
instead ofoverlap
.The best write config was :
which corresponds to:
single_submit=false, overlap_events=true, queue_depth=32, block_size=262144
Unfortunately, users don't currently have the ability to have separate read and write configurations, so they need to combine the best of both. Fortunately, in this case, and in most cases, the best read and write configurations are the same or similar.
Reasonable defaults are hard to set because of device and system differences. On many setups we tested
block_size=1M
had consistently seemed optimal across two clusters, but in this particular setup,block_size=256K
seems to be optimal.Finally, the last remaining config value is
thread_count=1
is reasonable default, since this is per-rank configuration.TODO: this config generation can be automated, but need to figure out what to do if read and write top benchmark don't agree.
Sample stats: for XPG Gammix s11 pro 2tb NVMe drive with published specs of:
The benchmark records throughput for ~400 different configuration combinations
I tried my 860 Evo SSD and getting ~0.5 GB/s read throughput. So about ~6x slower.
TODO/Questions to @tjruwase:
[ ] so we have a huge range of numbers - e.g. for read 1 to 3GB/s - so I suppose this is the effective range depending on the kind of task, so the low and the high should be considered - but how does this correlate to training? which of the 400 data points are most relevant? That's too much data for a user to make sense of. Perhaps it should just report and the min and max?
[ ] what are the good numbers? So that the users will know if their NVMe is fast enough? I'm thinking the numbers from the paper?