hpc / ior

IOR and mdtest
Other
365 stars 161 forks source link

Read/write distribution and file size distribution #191

Open rldleblanc opened 4 years ago

rldleblanc commented 4 years ago

How would I specify a 10% write and 90% read for a job much like FIO? I need to operate on files rather than blocks, but would like to simulate this mixed workload. I'd also like to have files of different sizes randomly chosen by a distribution, again like FIO.

Please help me understand how to accomplish this with IOR.

glennklockwood commented 4 years ago

This is not a feature supported by IOR. You would have to run several IOR jobs concurrently to simulate a workload mix rather than a single workload.

rldleblanc commented 4 years ago

Can you do that in a script (by not having any kind of sync between RUN commands)? Or do you have to launch multiple IOR runs with mpirun and then sum the results manually?

JulianKunkel commented 4 years ago

Basically, the IOR code has been modularized allowing to run two concurrent IORs. I used the code to do this for a test with the IO-500. However, since then we rewrote the code and nobody cared for a driver. I found it still important but have no time to complete the feature currently. What is your use case in more detail?

On Tue, Oct 15, 2019 at 5:49 PM Robert LeBlanc notifications@github.com wrote:

Can you do that in a script (by not having any kind of sync between RUN commands)? Or do you have to launch multiple IOR runs with mpirun and then sum the results manually?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/hpc/ior/issues/191?email_source=notifications&email_token=ABGW5SV2ONJWKMWT6ZKSD4LQOXYCZA5CNFSM4JA6WM3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBJO4SQ#issuecomment-542305866, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGW5STCWOUCL657ZMTFHZTQOXYCZANCNFSM4JA6WM3A .

-- Dr. Julian Kunkel Lecturer, Department of Computer Science +44 (0) 118 378 8218 http://www.cs.reading.ac.uk/ https://hps.vi4io.org/ PGP Fingerprint: 1468 1A86 A908 D77E B40F 45D6 2B15 73A5 9D39 A28E

rldleblanc commented 4 years ago

We are trying to replicate our workloads and see how many/much we can run on some storage to verify it meets our demands before we cut a PO. Having a straight read or write workload is not something that ever happens in our environment. Our jobs also have many files most are very small, but other workloads deal with large sequential files. The hope is to generate jobs that match each of these applications, run them individually and then run them concurrently and gather performance data.

JulianKunkel commented 4 years ago

I see. A key issue that I stopped thinking the last time is how to stop the benchmarks and how assess the performance. i.e. assume you run together three benchmarks with a configuration as follows:

The assessment of performance, I tend to say one likes to see then the performance numbers of each benchmark, i.e., a report similarly than running each of them individually. Let me hear what you think.

On Tue, Oct 15, 2019 at 9:38 PM Robert LeBlanc notifications@github.com wrote:

We are trying to replicate our workloads and see how many/much we can run on some storage to verify it meets our demands before we cut a PO. Having a straight read or write workload is not something that ever happens in our environment. Our jobs also have many files most are very small, but other workloads deal with large sequential files. The hope is to generate jobs that match each of these applications, run them individually and then run them concurrently and gather performance data.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hpc/ior/issues/191?email_source=notifications&email_token=ABGW5SWQHAA3YK5K52GHVY3QOYS3NA5CNFSM4JA6WM3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBKEOTI#issuecomment-542394189, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGW5SV7TMYMDNOHBSEAIKDQOYS3NANCNFSM4JA6WM3A .

-- Dr. Julian Kunkel Lecturer, Department of Computer Science +44 (0) 118 378 8218 http://www.cs.reading.ac.uk/ https://hps.vi4io.org/ PGP Fingerprint: 1468 1A86 A908 D77E B40F 45D6 2B15 73A5 9D39 A28E

rldleblanc commented 4 years ago

I think you make some good points. I'm going to lean on my experience with fio as I'm familiar with it and it does a really good job in most cases (there are shortcomings). In fio, a job may be a mix of read and write specified by percentage, so in that case when X bytes are transferred or the time is up both read and write stop at the same time because it is a single job. This works well to replicate a workload that my read a bunch of data, then write a summary and take maybe 15-45 seconds to run, then run 200k of those across 400 nodes and by the middle you have this really mixed read/write workload because each task is completing at different times. That middle part is what I find most interesting and trying to replicate.

I think there could be value in metrics in the case of stonewalling or wearout. I can even see the case where you may want a warmup period and only collect metrics for a certain period of time. You may like to have the workload run for 30 seconds then start collecting performance data because you want to make sure that the cache effect is minimized or there may be some characteristic of the storage that is different at the start of load. You may then only want to capture 60 seconds of performance data after the warmup to reduce any effects by jobs not finishing on time, etc.

If running multiple jobs concurrently, I'd love to see statistics for each job individually and as an overall metric (although I can always add numbers manually to get a total). That would allow me to see how different workload interact with each other better.

A final use case is to torture test the storage where we run this load for 24-48 hours to understand its performance over a long period of time. Having time series metrics would be really useful in this case as we could see how performance changes over time. Simple cases that illustrate this would be commodity SSDs when they run out of write buffers and are having to write directly to TLC or evict buffers while writes are still coming in, ZFS file systems generally degrade around 80-90% full and tiering systems that need to evict or promote block/files after cache exhaustion. Most times short tests do not tease out these issues and by having time series data, you can collect that without having to do multiple runs.

Another thing that I'd like to be able to do is have files spread across an average of X directories that are Y deep as some of our jobs generate 100,000's of files that are spread across a directory tree. I'm not seeing how to do that at the moment.

I really like IORs ability to run over a large number of nodes easily and that it does file IO rather than block, can easily verify the data after a run (to check for corruption) and has so many knobs that can be tuned that if I can figure out how to do the above would be the ultimate testing tool for us.

rldleblanc commented 4 years ago

Do you have some suggestions on how to configure IOR jobs to accomplish what I'm looking for? I can understand that everything I'm looking for may not be available. Thanks!

johnbent commented 4 years ago

Hey Robert,

You might also take a look at mdtest which comes with IOR. It basically is the metadata equivalent of IOR (which I consider to be a great bandwidth tester). mdtest can do large directory structures and operations on lots of files but doesn't really do very much file IO (although it can do some). Finally, I am one of the four organizers of IO500 which essentially is a benchmark suite that runs multiple IOR and mdtest benchmarks serially and computes an overall score. The goal of IO500 is to attempt to create a single performance fingerprint for large distributed storage systems. You might also consider it (https://github.com/VI4IO/io-500-dev).

For large scale stress testing, I think your best bet is to somehow attempt to run a succession of a mix of IOR and mdtest tests. For example, script something up that monitors the job queue and attempts to always maintain a ratio of something like 25% IOR read, 25% IOR write, 20% mdtest create, 20% mdtest stat, 10% mdtest delete. That's a complex undertaking but it's the best thing I can currently imagine. Then computing performance might be hard; you'll need more scripts to parse all the output files and figure out what was running concurrently. That's probably pretty hard. Although a different way to get aggregate performance might be to query the storage system itself. Most storage systems come with some sort of web portal and RESTful API to collect performance metrics.

Hope this helps. Good luck and I'm looking forward to hearing about what you end up doing so please do share back here if you can after you figure this out,

John

On Mon, Oct 28, 2019 at 8:33 AM Robert LeBlanc notifications@github.com wrote:

Do you have some suggestions on how to configure IOR jobs to accomplish what I'm looking for? I can understand that everything I'm looking for may not be available. Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/hpc/ior/issues/191?email_source=notifications&email_token=AAPT2PTPMFUTFQ4WTRJKVS3QQ3Z4HA5CNFSM4JA6WM3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECNCSXA#issuecomment-546974044, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAPT2PXBBZV5JUPADUSYGLTQQ3Z4HANCNFSM4JA6WM3A .

rldleblanc commented 4 years ago

I'm trying to write out millions of files of a certain size. ior struggles in this regard due to a single process per file. Is it possible to have ior write out multiple file per process? Or would mdtest be a good substitute by setting -w and -e appropriately? I don't see a transfer size, but since we are mostly testing sequential access, I don't think it matters too much. Thoughts?

johnbent commented 4 years ago

What size files are you looking to create? If they are very large then mdtest won't work. IOR allows you to create a very large file doing multiple writes whereas mdtest does the full file in just a single write. Therefore, it is fine for files small enough that you can allocate a memory buffer of that size. If you are testing sequential access AND your file size is small enough to be transferred in a single IO AND you want multiple files per process, then mdtest should work great for you. On most file systems, you will get better performance if you also pass '-u' so that each process has its own private sub-directory.

Thanks,

John

On Tue, Nov 12, 2019 at 8:05 AM Robert LeBlanc notifications@github.com wrote:

I'm trying to write out millions of files of a certain size. ior struggles in this regard due to a single process per file. Is it possible to have ior write out multiple file per process? Or would mdtest be a good substitute by setting -w and -e appropriately? I don't see a transfer size, but since we are mostly testing sequential access, I don't think it matters too much. Thoughts?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hpc/ior/issues/191?email_source=notifications&email_token=AAPT2PUGVGJADJGU72IY5ZLQTIJAVA5CNFSM4JA6WM3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDY2J7Y#issuecomment-552707327, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAPT2PQX7CPXLCZF63VJE2DQTIJAVANCNFSM4JA6WM3A .

rldleblanc commented 4 years ago

Most of the large number of files is between 0.5MB and 1.0 MB per file. I have one test at 4.5MB and another at 200MB per file that I don't think would work. I might be able to get the first ones to work. I'll give it a try tomorrow, I may just have to cut out some tests in the end. Thank you for the info.

johnbent commented 4 years ago

I would assume that even the 200MB would work just fine.

On Tue, Nov 12, 2019 at 12:34 PM Robert LeBlanc notifications@github.com wrote:

Most of the large number of files is between 0.5MB and 1.0 MB per file. I have one test at 4.5MB and another at 200MB per file that I don't think would work. I might be able to get the first ones to work. I'll give it a try tomorrow, I may just have to cut out some tests in the end. Thank you for the info.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hpc/ior/issues/191?email_source=notifications&email_token=AAPT2PQYXBD3A3QJSHGMOUTQTJIPTA5CNFSM4JA6WM3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDZIAHQ#issuecomment-552763422, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAPT2PQJJGDWRDSSJRDUCLLQTJIPTANCNFSM4JA6WM3A .

rldleblanc commented 4 years ago

Question on MDtest. The rate numbers that are output at the end of the run is files/sec, correct? So to calculate bandwidth you would just multiply that by the write/read size? Does mdtest do the read on the same rank as the write and thus could be affected by page cache? I assume this is the case as for small jobs I'm getting read rates much higher than expected where larger jobs seem much more reasonable. Thanks!

johnbent commented 4 years ago

Yeah. Pass -N 1 to “shift” the readers to a different node and avoid client side caching.

On Fri, Nov 15, 2019 at 10:27 PM Robert LeBlanc notifications@github.com wrote:

Question on MDtest. The rate numbers that are output at the end of the run is files/sec, correct? So to calculate bandwidth you would just multiply that by the write/read size? Does mdtest do the read on the same rank as the write and thus could be affected by page cache? I assume this is the case as for small jobs I'm getting read rates much higher than expected where larger jobs seem much more reasonable. Thanks!

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/hpc/ior/issues/191?email_source=notifications&email_token=AAPT2PW5LGDKHLIEGKAQMVDQT3IGJA5CNFSM4JA6WM3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEGBUKY#issuecomment-554441259, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAPT2PR7RKCYG6HFRN36S6TQT3IGJANCNFSM4JA6WM3A .

rldleblanc commented 4 years ago

Thanks! Another question. How compressible is https://github.com/hpc/ior/blob/3a34b2efb7b0d5faafa0802426907cef3c3f3fc3/src/mdtest.c#L214-L218? It seems that shoving an int into a byte array would either just leave the MSBs in the array (which may be only zeros if int is 64 bits) or the LSBs which would be less compressible.

johnbent commented 4 years ago

Thanks Robert. I don't think I'm qualified to speak to compressability although I suspect that these buffers are highly compressible. Hopefully someone else on the list can speak to this with greater authority.

On Mon, Nov 18, 2019 at 2:46 PM Robert LeBlanc notifications@github.com wrote:

Thanks! Another question. How compressible is https://github.com/hpc/ior/blob/3a34b2efb7b0d5faafa0802426907cef3c3f3fc3/src/mdtest.c#L214-L218? It seems that shoving an int into a byte array would either just leave the MSBs in the array (which may be only zeros if int is 64 bits) or the LSBs which would be less compressible.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hpc/ior/issues/191?email_source=notifications&email_token=AAPT2PTCHIA4NE7ZEBSPH33QUMEKLA5CNFSM4JA6WM3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEMAXMA#issuecomment-555223984, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAPT2PQERHA3VSBNGVWRAV3QUMEKLANCNFSM4JA6WM3A .

JulianKunkel commented 3 years ago

@rldleblanc I'm curious if you resolved your suggestion. I think these are all excellent points for discussion.