Open atorrez opened 11 years ago
So here are some preliminary findings:
On smog:
aprun -n 16 ./fs_test.smog.x -target plfs://users/dbonnie/plfs_n1/out.%s -io mpi -time 5 -barriers aopen -shift -type 2 -sync -size 1 -nodb -strided 1
A single node writing 1 byte blocks will cause this with only a 5 second write time when using a glib backend (because glib is MUCH faster than posix).
On rrz:
mpirun -n 16 ./fs_test.localhost.x -strided 1 -target plfs:/users/dbonnie/plfs_n1/out.%s -io mpi -time 30 -barriers aopen -shift -type 2 -sync -size 1 -nodb
Causes a failure in 30 seconds (because PanFS even with glib is much slower than smog's Lustre).
I have the sneaking feeling this is an index bug or something because these runs create millions and millions of index entries. I have confirmed that this also occurs with compress_contiguous set to false in the plfsrc.
My plfsrc files contain very similar entries between the two systems:
rrz:
smog:
So first thought is this may be in older releases as well but it's hard to test. I think Alfred or I will try to confirm that soon with a longer test suite for 2.3.
Also, the errors change between runs which makes me think memory corruption in some form or another although bad_alloc seems to be quite common.
I've narrowed this down a bit more -- I'm almost positive the problem lies in the indexes when they get to a certain size.
So far the following causes the error:
128 procs (16 nodes), 70000 1-byte writes each proc. Each index is 3.3 MB, total memory usage per node for indexes should be about 3.4 GB per node(each has 16 GB)...so this probably shouldn't fail.
16 procs (2 nodes), 400000 1-byte writes each proc. Each index is 28 MB, total memory usage for indexes should be 3.5 GB per node (each has 16 GB)...so this probably shouldn't fail either.
1 proc, 25000000 1 byte writes. Total index is 1.2 GB.
So clearly something fishy is going on here. I'm going to test at larger scales and with an older version of PLFS to see if I can narrow this down a bit better.
Updates:
512 procs (64 nodes), 15000 1 byte writes each. Each index is 704 KB, total usage per node is ~2.75 GB.
All of the above are from RRZ. Smog is a bit different but fails at similar counts per node (though memory usage for indexes at 1088 procs on Smog results in ~12 GB...way more than what fails on RRZ).
The strided access is what's causing every write to be a separate index entry but it shouldn't be crashing where it is.
Single node on RRZ: 8 procs, 1000000 objects each, roughly the same memory usage in aggregate to crash.
Confirmed that the exact same workload fails on PLFS 2.3...about 46 MB of index per proc, so loading it on a single node uses around 2.9 GB of RAM. There's 16 GB on a node on RRZ.
Im just curious how our regression failed to find this in 2.3 but did for 2.4?
On Mon, Jun 3, 2013 at 5:24 PM, David Bonnie notifications@github.comwrote:
Confirmed that the exact same workload fails on PLFS 2.3.
— Reply to this email directly or view it on GitHubhttps://github.com/plfs/plfs-core/issues/251#issuecomment-18872367 .
2.4 is faster than 2.3. The test was right on the edge of failing with 2.3 in terms of object counts after 300 seconds of writing.
That's pretty awesome actually.
Why faster?
On Jun 3, 2013, at 5:30 PM, David Bonnie notifications@github.com wrote:
2.4 is faster than 2.3. The test was right on the edge of failing with 2.3 in terms of object counts after 300 seconds of writing.
— Reply to this email directly or view it on GitHub.
Various little things all adding up I think.
I might have jumped the gun a bit on 2.3 failing - I had linked things weirdly for the my custom MPI and I think I hooked into that accidentally for my 2.3 test. I'll update when I figure out what I managed to muck up. :)
Okay - fixed my install and I've gotten 2.3 to fail with OpenMPI 1.6.4 on RRZ in the exact same manner as 2.4rc2.
Watching a run on a single node I was able to see what's happening.
Here's what top shows as the memory is rising on one proc:
11459 dbonnie 20 0 134m 13m 4636 R 100.2 0.1 2:45.79 fs_test.localho 11460 dbonnie 20 0 135m 13m 4668 R 100.2 0.1 2:45.92 fs_test.localho 11463 dbonnie 20 0 134m 13m 4632 R 99.9 0.1 2:45.71 fs_test.localho 11457 dbonnie 20 0 134m 13m 4620 R 99.6 0.1 2:45.31 fs_test.localho 11461 dbonnie 20 0 134m 13m 4632 R 99.6 0.1 2:46.01 fs_test.localho 11462 dbonnie 20 0 134m 13m 4616 R 99.6 0.1 2:45.25 fs_test.localho 11456 dbonnie 20 0 2372m 2.2g 5636 R 98.9 14.0 2:45.71 fs_test.localho 11458 dbonnie 20 0 134m 13m 4644 R 98.9 0.1 2:46.02 fs_test.localho
That proc will rise to just over 5.6 GB, then the following happens:
11525 dbonnie 20 0 1581m 1.4g 4860 R 100.6 9.1 3:47.22 fs_test.localho 11523 dbonnie 20 0 4656m 4.4g 5780 R 99.6 28.1 3:46.44 fs_test.localho 11524 dbonnie 20 0 1629m 1.5g 4844 R 99.6 9.3 3:47.21 fs_test.localho 11526 dbonnie 20 0 1531m 1.4g 4824 R 99.6 8.7 3:46.80 fs_test.localho 11527 dbonnie 20 0 1658m 1.5g 4872 R 99.6 9.5 3:46.64 fs_test.localho 11528 dbonnie 20 0 1575m 1.4g 4828 R 99.6 9.0 3:44.96 fs_test.localho 11529 dbonnie 20 0 1612m 1.5g 4844 R 99.6 9.2 3:47.24 fs_test.localho 11530 dbonnie 20 0 1664m 1.5g 4796 R 99.6 9.6 3:47.19 fs_test.localho
The proc that was using 5+ GB of memory starts to drop in memory usage slightly but every other proc's memory usage skyrockets to the point of a crash.
How did this pass regression in 2.3 then? No longer sounds like race caused by increased performance.
On Jun 4, 2013, at 2:40 PM, David Bonnie notifications@github.com wrote:
Watching a run on a single node I was able to see what's happening.
The write phase runs and completes normally with low memory usage. The call to MPI_File_open causes a single proc to load the index. Memory usage exceeds 5 GB for this proc. The proc that loaded the index then (I assume) distributes this to the other procs which causes the available memory to drop to 0. Hammer...err...Crash time. Here's what top shows as the memory is rising on one proc:
11459 dbonnie 20 0 134m 13m 4636 R 100.2 0.1 2:45.79 fs_test.localho
11460 dbonnie 20 0 135m 13m 4668 R 100.2 0.1 2:45.92 fs_test.localho
11463 dbonnie 20 0 134m 13m 4632 R 99.9 0.1 2:45.71 fs_test.localho
11457 dbonnie 20 0 134m 13m 4620 R 99.6 0.1 2:45.31 fs_test.localho
11461 dbonnie 20 0 134m 13m 4632 R 99.6 0.1 2:46.01 fs_test.localho
11462 dbonnie 20 0 134m 13m 4616 R 99.6 0.1 2:45.25 fs_test.localho
11456 dbonnie 20 0 2372m 2.2g 5636 R 98.9 14.0 2:45.71 fs_test.localho
11458 dbonnie 20 0 134m 13m 4644 R 98.9 0.1 2:46.02 fs_test.localho
That proc will rise to just over 5.6 GB, then the following happens:
11525 dbonnie 20 0 1581m 1.4g 4860 R 100.6 9.1 3:47.22 fs_test.localho
11523 dbonnie 20 0 4656m 4.4g 5780 R 99.6 28.1 3:46.44 fs_test.localho
11524 dbonnie 20 0 1629m 1.5g 4844 R 99.6 9.3 3:47.21 fs_test.localho
11526 dbonnie 20 0 1531m 1.4g 4824 R 99.6 8.7 3:46.80 fs_test.localho
11527 dbonnie 20 0 1658m 1.5g 4872 R 99.6 9.5 3:46.64 fs_test.localho
11528 dbonnie 20 0 1575m 1.4g 4828 R 99.6 9.0 3:44.96 fs_test.localho
11529 dbonnie 20 0 1612m 1.5g 4844 R 99.6 9.2 3:47.24 fs_test.localho
11530 dbonnie 20 0 1664m 1.5g 4796 R 99.6 9.6 3:47.19 fs_test.localho
The proc that was using 5+ GB of memory starts to drop in memory usage slightly but every other proc's memory usage skyrockets to the point of a crash.
— Reply to this email directly or view it on GitHub.
The workload I established isn't time-based so it hits it regardless of speed.
How big are the droppping files on the backend after the write phase completes? just curious...
chuck
On Tue, Jun 04, 2013 at 11:40:26AM -0700, David Bonnie wrote:
Watching a run on a single node I was able to see what's happening.
- The write phase runs and completes normally with low memory usage.
- The call to MPI_File_open causes a single proc to load the index. Memory usage exceeds 5 GB for this proc.
- The proc that loaded the index then (I assume) distributes this to the other procs which causes the available memory to drop to 0.
- Hammer...err...Crash time.
Here's the contents of the hostdir:
-rw-r--r-- 1 dbonnie dbonnie 0 Jun 4 12:32 dropping.data.1370370765.643511.rrz004.localdomain.0 -rw-r--r-- 1 dbonnie dbonnie 977K Jun 4 12:35 dropping.data.1370370765.685680.rrz004.localdomain.0 -rw-r--r-- 1 dbonnie dbonnie 977K Jun 4 12:35 dropping.data.1370370765.697148.rrz004.localdomain.7 -rw-r--r-- 1 dbonnie dbonnie 977K Jun 4 12:35 dropping.data.1370370765.697149.rrz004.localdomain.1 -rw-r--r-- 1 dbonnie dbonnie 977K Jun 4 12:35 dropping.data.1370370765.697153.rrz004.localdomain.3 -rw-r--r-- 1 dbonnie dbonnie 977K Jun 4 12:35 dropping.data.1370370765.697154.rrz004.localdomain.4 -rw-r--r-- 1 dbonnie dbonnie 977K Jun 4 12:35 dropping.data.1370370765.697155.rrz004.localdomain.5 -rw-r--r-- 1 dbonnie dbonnie 977K Jun 4 12:35 dropping.data.1370370765.697180.rrz004.localdomain.2 -rw-r--r-- 1 dbonnie dbonnie 977K Jun 4 12:35 dropping.data.1370370765.697778.rrz004.localdomain.6 -rw-r--r-- 1 dbonnie dbonnie 0 Jun 4 12:32 dropping.index.1370370765.643511.rrz004.localdomain.0 -rw-r--r-- 1 dbonnie dbonnie 46M Jun 4 12:35 dropping.index.1370370765.685680.rrz004.localdomain.0 -rw-r--r-- 1 dbonnie dbonnie 46M Jun 4 12:35 dropping.index.1370370765.697148.rrz004.localdomain.7 -rw-r--r-- 1 dbonnie dbonnie 46M Jun 4 12:35 dropping.index.1370370765.697149.rrz004.localdomain.1 -rw-r--r-- 1 dbonnie dbonnie 46M Jun 4 12:35 dropping.index.1370370765.697153.rrz004.localdomain.3 -rw-r--r-- 1 dbonnie dbonnie 46M Jun 4 12:35 dropping.index.1370370765.697154.rrz004.localdomain.4 -rw-r--r-- 1 dbonnie dbonnie 46M Jun 4 12:35 dropping.index.1370370765.697155.rrz004.localdomain.5 -rw-r--r-- 1 dbonnie dbonnie 46M Jun 4 12:35 dropping.index.1370370765.697180.rrz004.localdomain.2 -rw-r--r-- 1 dbonnie dbonnie 46M Jun 4 12:35 dropping.index.1370370765.697778.rrz004.localdomain.6
I wouldn't imagine this is enough data to cause an issue unless the index entries expand drastically in-memory.
On Tue, Jun 04, 2013 at 12:29:28PM -0700, David Bonnie wrote:
Here's the contents of the hostdir: I wouldn't imagine this is enough data to cause an issue unless the index entries expand drastically in-memory.
Yeah, I don't see how you get to 5GB from that. and if you complete the write phase, then it closes the file, so everthing should be written out to the backend before it starts the read phase, right? (you could run fs_test with "-op write" and then do second "-op read" run to ensure it is all down to backend...)
chuck
It's definitely all hitting the backend before doing the open call (I did a write-only test and the results match up with the above backend listing).
On Tue, Jun 04, 2013 at 12:43:50PM -0700, David Bonnie wrote:
It's definitely all hitting the backend before doing the open call (I did a write-only test and the results match up with the above backend listing).
what happens if you generate a global index after a write (e.g. with the "plfs_flatten_index" tool)?
chuck
Flattening the index results in a 367 MB global index. Trying to read it now.
Looks like the same behavior on read. Is there a way to use the global index when doing an MPI read or is that handled automatically?
On Tue, Jun 04, 2013 at 02:25:41PM -0700, David Bonnie wrote:
Looks like the same behavior on read. Is there a way to use the global index when doing an MPI read or is that handled automatically?
It should look for the global index first and use it if it is found.
you could try writing a simple plfs API program that does a plfs_open for RDONLY and look at the memory usage of that before and after the open. that might help isolate the problem (e.g. is it in the MPI code, or the main lib?).
e.g. something like
main() { Plfs_fd fd; fd = NULL; int ret; printf("before open\n"); system("ps axuwwww"); / or maybe use getpid() to just look at this proc */ ret = plfs_open(&fd, "/big/file", O_RDONLY, getpid(), 0666, NULL); printf("after open, ret was %d\n", ret); system("ps axuwwww"); }
chuck
So trying to use the "plfs_flatten_close" MPI-IO hint results in an instant crash on close in fs_test due to a seg fault.
The backtrace: fs_test.localhost.x:557 terminated with signal 11 at PC=2ba3477729fa SP=7fffe032cd58. Backtrace: /lib64/libc.so.6(memcpy+0x1da)[0x2ba3477729fa] /usr/lib64/libstdc++.so.6(_ZNSt15basic_streambufIcSt11char_traitsIcEE6xsputnEPKcl+0x7b)[0x2ba3459abafb] /usr/lib64/libstdc++.so.6(_ZSt16__ostream_insertIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_PKS3_l+0x1b5)[0x2ba3459a2935] /users/dbonnie/plfs/lib/libplfs.so.2.4(_ZN5Index16global_to_streamEPPvPm+0x233)[0x2ba34568c293] /users/dbonnie/plfs/lib/libplfs.so.2.4(container_index_stream+0x73)[0x2ba34567a893] /users/dbonnie/source/plfs-regression/inst/mpi/lib/libmpi.so.1(flatten_then_close+0x56)[0x2ba345d3a366] /users/dbonnie/source/plfs-regression/inst/mpi/lib/libmpi.so.1(ADIOI_PLFS_Close+0x174)[0x2ba345d3a994] /users/dbonnie/source/plfs-regression/inst/mpi/lib/libmpi.so.1(ADIO_Close+0x3e)[0x2ba345d29f4e] /users/dbonnie/source/plfs-regression/inst/mpi/lib/libmpi.so.1(mca_io_romio_dist_MPI_File_close+0xef)[0x2ba345d4b6cf] /users/dbonnie/source/plfs-regression/inst/mpi/lib/libmpi.so.1(+0x8a38e)[0x2ba345ca038e] /users/dbonnie/source/plfs-regression/inst/mpi/lib/libmpi.so.1(ompi_file_close+0x59)[0x2ba345ca02a9] /users/dbonnie/source/plfs-regression/inst/mpi/lib/libmpi.so.1(PMPI_File_close+0x78)[0x2ba345ccdf18] ./fs_test.localhost.x(close_file+0x5c4)[0x407296] ./fs_test.localhost.x(read_write_file+0x619)[0x40879f] ./fs_test.localhost.x(main+0x1ae)[0x403f67] /lib64/libc.so.6(__libc_start_main+0xfd)[0x2ba347707cdd] ./fs_test.localhost.x[0x403c69]
Mostly just putting that here for future reference. Still digging.
The "flatten" flag for fs_test is only valid if you use I/O type of PLFS.
#
I'm not sure about any hint to pass. I don't see anything in the code or template files that mentions using such a hint.
So, you could try to change the io type to plfs and then turn the flatten flag on and see if that works.
Brett
[ignore this]Actually, using fs_test "--flatten 1" results in a crash too, even with a fairly small workload.[/ignore this]
EDIT: I found the flatten flag by looking in ad_plfs -- so it should work for MPI-IO, no?
Using io of "plfs"? I didn't look at the code closely, but if you use io that is not "plfs" you should just get an exit from the application without any I/O happening and a note that you can use flatten with an io that is not "plfs".
Brett
Using either plfs or mpi the -flatten 1 flag crashes with a seg fault.
Using the mpi-io hint "plfs_flatten_close" clearly starts the flattening process but then crashes with a seg fault as well.
I'm assuming you've checked that workload=n-1 in the plfsrc. Flatten doesn't make sense in the other modes; it shouldn't crash it though.
John
On Wed, Jun 5, 2013 at 11:27 AM, David Bonnie notifications@github.comwrote:
Using either plfs or mpi the -flatten 1 flag crashes with a seg fault.
Using the mpi-io hint "plfs_flatten_close" clearly starts the flattening process but then crashes with a seg fault as well.
On Jun 5, 2013, at 9:20 AM, Brett Kettering wrote:
Using io of "plfs"? I didn't look at the code closely, but if you use io that is not "plfs" you should just get an exit from the application without any I/O happening and a note that you can use flatten with an io that is not "plfs".
Brett
From: David Bonnie <notifications@github.com<mailto: notifications@github.com>mailto:notifications@github.com> Reply-To: plfs/plfs-core <reply@reply.github.com<mailto: reply@reply.github.com>mailto:reply@reply.github.com> Date: Wed, 5 Jun 2013 08:16:50 -0700 To: plfs/plfs-core <plfs-core@noreply.github.com<mailto: plfs-core@noreply.github.com>mailto:plfs-core@noreply.github.com> Cc: Brett Kettering brettk@lanl.gov<mailto:brettk@lanl.gov<mailto: brettk@lanl.gov>> Subject: Re: [plfs-core] Getting bad_alloc Error on N-1 Strided using Certain Block Sizes (#251)
Actually, using fs_test "--flatten 1" results in a crash too, even with a fairly small workload.
— Reply to this email directly or view it on GitHub< https://github.com/plfs/plfs-core/issues/251#issuecomment-18983481>.
— Reply to this email directly or view it on GitHub< https://github.com/plfs/plfs-core/issues/251#issuecomment-18983782>.
— Reply to this email directly or view it on GitHubhttps://github.com/plfs/plfs-core/issues/251#issuecomment-18984205 .
OK. So, it's a bug for fs_test to let you use the --flatten flag with io of "mpi".
Do you see this line printed?
"INFO RANK[%d]: Flattening plfs index
where rank is zero (0)?
Only rank zero is allowed to call plfs_flatten_index.
Look at the flatten_file function in fs_test.c. Put some debug statements in there to make sure the target path is correct to determine if fs_test is messing up or plfs_flatten_index is messing up.
Thanks, Brett
Yeah it prints that for rank 0 only (though with the mpi-io hint I believe it's trying to flatten on all ranks). This is to an N-1 mount, single backend, nothing crazy in terms of parameters.
I'll keep digging.
If the hint is trying to flatten on all ranks, then that must be in the ad_* code because fs_test's flatten_file will only call plfs_flatten_index if it is rank 0 or if it's N-N.
Brett
A note on memory usage (running plfs_map) for reading an index from two procs writing 12,000,000 objects of 1-byte each, strided. Each index dropping is ~550 MB.
Someone correct me if I'm wrong, but shouldn't I expect to see approximately 550 MB * [2 nodes] in terms of memory usage when reading the index with plfs_map? Instead, I see nearly 10 GB.
Reading back the index with "--uniform 0" to only read a single rank also results in a huge amount of memory usage in relation to the size of the index (4.6 GB).
Sounds like a gigantic memory leak.
John
On Jun 5, 2013, at 3:57 PM, David Bonnie notifications@github.com wrote:
A note on memory usage (running plfs_map) for reading an index from two procs writing 12,000,000 objects of 1-byte each, strided. Each index dropping is ~550 MB.
Someone correct me if I'm wrong, but shouldn't I expect to see approximately 550 MB * [2 nodes] in terms of memory usage when reading the index with plfs_map? Instead, I see nearly 10 GB.
Reading back the index with "--uniform 0" to only read a single rank also results in a huge amount of memory usage in relation to the size of the index (4.6 GB).
— Reply to this email directly or view it on GitHub.
yeah. if it happens in plfs_map that's cool because it rules out stuff and is easy to run and debug...
chuck
On Wed, Jun 05, 2013 at 01:07:17PM -0700, John Bent wrote:
Sounds like a gigantic memory leak. John
On Jun 5, 2013, at 3:57 PM, David Bonnie notifications@github.com wrote:
A note on memory usage (running plfs_map) for reading an index from two procs writing 12,000,000 objects of 1-byte each, strided. Each index dropping is ~550 MB.
Someone correct me if I'm wrong, but shouldn't I expect to see approximately 550 MB * [2 nodes] in terms of memory usage when reading the index with plfs_map? Instead, I see nearly 10 GB.
Reading back the index with "--uniform 0" to only read a single rank also results in a huge amount of memory usage in relation to the size of the index (4.6 GB).
? Reply to this email directly or view it on GitHub.
Reply to this email directly or view it on GitHub: https://github.com/plfs/plfs-core/issues/251#issuecomment-19004858
I'm tracking down where the memory is being allocated. I don't think it's a leak (as in allocated but not freed) but more in how the memory is handled for the global index creation.
Just an update on this - to reduce memory usage there really just needs to be a rewrite of the global index creation (I don't believe there is really a bug here, just higher memory usage than I initially expected). Perhaps a shared-memory approach on a per-node basis?
Have you been able to quantify the memory allocation? Where is it all going?
earlier you said "Flattening the index results in a 367 MB global index" so how many GBs of RAM was a 367MB index taking? Was it like 5GB?
chuck
On Mon, Jun 10, 2013 at 10:18:34AM -0700, David Bonnie wrote:
Just an update on this - to reduce memory usage there really just needs to be a rewrite of the global index creation (I don't believe there is really a bug here, just higher memory usage than I initially expected). Perhaps a shared-memory approach on a per-node basis?
Reply to this email directly or view it on GitHub: https://github.com/plfs/plfs-core/issues/251#issuecomment-19213367
The in-memory map results in most of the memory pressure (over 90% of the in-use memory is from the global index data structures). The on-disk size is roughly tripled in-memory per-process so that coupled with the various buffers and mmap'd files causes memory exhaustion with the cases I highlighted above.
A lot of the memory usage when using plfs_map initially was caused by a massive string construction for the output. I removed that and memory usage fell dramatically (though was still higher than I initially expected) to the quantities above.
What is the ratio of in-memory size to on-disk size?
John
On Jun 10, 2013, at 12:32 PM, David Bonnie notifications@github.com<mailto:notifications@github.com> wrote:
The in-memory map results in most of the memory pressure (over 90% of the in-use memory is from the global index data structures). The on-disk size is roughly tripled in-memory per-process so that coupled with the various buffers and mmap'd files causes memory exhaustion with the cases I highlighted above.
A lot of the memory usage when using plfs_map initially was caused by a massive string construction for the output. I removed that and memory usage fell dramatically (though was still higher than I initially expected) to the quantities above.
— Reply to this email directly or view it on GitHubhttps://github.com/plfs/plfs-core/issues/251#issuecomment-19214154.
In my limited tests it's 2-3 times larger in-memory than on-disk.
In running a sweep of N-1 strided performance using fs_test and plfs version 2.4rc2 on rrz, I got bad_alloc errors when opening for read. This happened using a 32k block size. I got some errors and some successes using other block sizes.
Here is the command line I used: mpirun -n 32 /users/atorrez/Testing-ext/fs_test/fs_test/fs_test.rrz.x -strided 1 -target plfs:/var/tmp/plfs.atorrez1/atorrez/out.%s -io mpi -time 300 -barriers aopen -shift -type 2 -sync -size 32768
Dave Bonnie is seeing the same errors on smog and rrz using smaller block sizes and shorter -time parameters.