plfs / plfs-core

LANL no longer develops PLFS. Feel free to fork and develop as you wish.
41 stars 36 forks source link

Update version stamps for droppings and .plfsdebug / plfs_version output + FUSE fixes + documentation fix #317

Closed thewacokid closed 10 years ago

thewacokid commented 11 years ago

Removed the svn tag since it's no longer relevant

thewacokid commented 11 years ago

This merge now includes the in-progress fixes for the FUSE layer. Testing is ongoing, I don't think I'll be able to sort out the last of the thread-safety issues with regards to the index in N-1 mode. See comments here: https://github.com/plfs/plfs-core/issues/166

thewacokid commented 11 years ago

Update for https://github.com/plfs/plfs-core/issues/318 also included in the pull.

thewacokid commented 10 years ago

Okay, question here. I know we use mutexes because they are quick...but wouldn't read/write locks be a better idea for the FUSE layer? I don't have a good grasp of how much slower read/write locks are versus mutexes but I can't imagine it's that much.

It seems like we might be over-locking on critical sections where we really only need to be careful when doing an update (since that's the only time we're unsafe with concurrent access).

johnbent commented 10 years ago

I assume you mean the pthread one and not flock()?

http://linux.die.net/man/3/pthread_rwlock_wrlock

On Wed, Oct 9, 2013 at 10:37 AM, David Bonnie notifications@github.comwrote:

Okay, question here. I know we use mutexes because they are quick...but wouldn't read/write locks be a better idea for the FUSE layer? I don't have a good grasp of how much slower read/write locks are versus mutexes but I can't imagine it's that much.

It seems like we might be over-locking on critical sections where we really only need to be careful when doing an update (since that's the only time we're unsafe with concurrent access).

— Reply to this email directly or view it on GitHubhttps://github.com/plfs/plfs-core/pull/317#issuecomment-25986576 .

thewacokid commented 10 years ago

Exactly. Since we're already using pthreads it would be a simple change as well -- but I'm not sure if the performance overhead of using rwlocks is worth it in all cases (I'm just not familiar enough with them to make that call).

thewacokid commented 10 years ago

Hmm...not many of the structures would benefit from this after looking at it a bit. The mode lookups might...perhaps the OpenFiles indexing as well. It depends on how much slower rwlocks are compared to a simple mutex.

thewacokid commented 10 years ago

I'm testing this since it was a relatively simple change - I imagine it might hurt single-threaded access slightly but should help on heaver workloads (like a mount point on an FTA that gets hit by multiple users?).

I'll test single-threaded and multi-threaded access and put up the results compared to the current implementation when the runs finish.

thewacokid commented 10 years ago

Okay, so the multithreaded results are slower with rwlocks versus mutexes...giving up on that idea for now. :)

thewacokid commented 10 years ago

Okay, so a rwlock on the groups was causing the performance degradation. Using a rwlock in the place of the write_mutex grants performance essentially identical to not having the lock at all and makes us a bit more safe in N-1 mode.

chuckcranor commented 10 years ago

I've lost track of which lock is the problem. is this all at the FUSE level?

chuck

On Thu, Oct 10, 2013 at 09:13:16AM -0700, David Bonnie wrote:

Okay, so a rwlock on the groups was causing the performance degradation. Using a rwlock in the place of the write_mutex grants performance essentially identical to not having the lock at all and makes us a bit more safe in N-1 mode.


Reply to this email directly or view it on GitHub: https://github.com/plfs/plfs-core/pull/317#issuecomment-26067622

thewacokid commented 10 years ago

Yes, up at the FUSE level (and in particular I'm talking about he lock I was placing around plfs_write and pfd->sync). I had initially tried to dig into the container code to see if I could find the bug there but was unsuccessful.

I kinda turned this into a running stream of notes to keep myself on track and in the process made it very hard to follow!

chuckcranor commented 10 years ago

ok. so you are getting a core dump in Container mode under some load still? But flat file is ok? what kind of stack trace are you getting now?

chuck

thewacokid commented 10 years ago

The only place that ever crashes, now, is when doing a sync through syncIfOpen in FUSE in container mode. I can't get flat file to crash under any kind of load.

The output of the stack dump points to something very similar to this every time it crashes (this is the output from my Macbook but tests on Linux show similar traces):

Thread 5 Crashed: 0 libplfs.2.5.dylib 0x0000000103ee0004 Index::flush() + 20 1 libplfs.2.5.dylib 0x0000000103ee681c WriteFile::sync() + 172 2 plfs 0x0000000103e44879 Plfs::syncIfOpen(std::string const&) + 329 3 plfs 0x0000000103e44c42 Plfs::getattrhelper(std::string, char const, stat_, Plfsfd) + 34 4 plfs 0x0000000103e45d93 Plfs::fgetattr(char const, stat*) + 643 5 libfuse4x.2.dylib 0x0000000103e73426 fuse_lib_setattr_x + 310 6 libfuse4x.2.dylib 0x0000000103e77a99 do_setattr + 313 7 libfuse4x.2.dylib 0x0000000103e7771f fuse_ll_process + 849 8 libfuse4x.2.dylib 0x0000000103e75cfc fuse_do_work + 188 9 libsystem_c.dylib 0x00007fff8bf117a2 _pthread_start + 327 10 libsystem_c.dylib 0x00007fff8befe1e1 thread_start + 13

It takes a lot of load to trigger it consistently and with a release build the fd_consistency test hasn't caught it with 100 iterations any time I've run it.

chuckcranor commented 10 years ago

I haven't looked at the locking recently so maybe this is a dumb question, but what protects this->hostIndex? if Index::flush() does this->hostIndex.clear() then nothing else should be using it when it goes.

chuck

thewacokid commented 10 years ago

That's why it's confusing -- everything involved at the lower levels is protected by mutexes. I couldn't pinpoint where the hostIndex is getting smashed.

chuckcranor commented 10 years ago

On Wed, Oct 16, 2013 at 08:04:28AM -0700, David Bonnie wrote:

That's why it's confusing -- everything involved at the lower levels is protected by mutexes. I couldn't pinpoint where the hostIndex is getting smashed.

Which mutex protects hostIndex from multiple modifiers? that wasn't clear to me from the src.

going forward, it would prob be a good idea to note the locking schemes in the .h files better.

chuck

thewacokid commented 10 years ago

It's protected from above by Writefile.cpp's index_mux. It's a bit convoluted since the locking is done in the layer above Index.cpp.

chuckcranor commented 10 years ago

On Wed, Oct 16, 2013 at 10:32:04AM -0700, David Bonnie wrote:

It's protected from above by Writefile.cpp's index_mux.
It's a bit convoluted since the locking is done in the layer above Index.cpp.

hmm, that is convoluted.

hostIndex is only accessed in 4 places:

  1. Index::flush()
  2. Index::memoryFootprintMBs()
  3. Index::addWrite()
  4. Index::truncateHostIndex()

so the caller of each of these must be holding Writefile index_mux before calling?

took a quick look, how about this path:

Container::Truncate() calls Index::rewriteIndex() calls Index::flush()

where is the lock of index_mux in that path?

chuck

chuckcranor commented 10 years ago

On Wed, Oct 16, 2013 at 10:32:04AM -0700, David Bonnie wrote:

It's protected from above by Writefile.cpp's index_mux. It's a bit convoluted since the locking is done in the layer above Index.cpp.

what about this path?

Container_fd::trunc() calls WriteFile::truncate() calls Index::truncateHostIndex()

chuck

thewacokid commented 10 years ago

Hmm...honestly I didn't look down that path because I wasn't calling truncate (not on purpose anyway).

I'll do some quick testing to see if locking down those access paths helps. By the end of sorting through stack traces I must have needed a second set of eyes to make sure I wasn't missing things like this! :)

chuckcranor commented 10 years ago

On Wed, Oct 16, 2013 at 01:05:14PM -0700, David Bonnie wrote:

Hmm...honestly I didn't look down that path because I wasn't calling truncate (not on purpose anyway).

could you be using open with O_TRUNC? that would call truncate too.

I'll do some quick testing to see if locking down those access paths helps. By the end of sorting through stack traces I must have needed a second set of eyes to make sure I wasn't missing things like this! :)

the function name overloading doesn't help here eithe. we've got multple addWrites, Truncates, etc. makes greping/searching around for calls more difficult (for me at any rate).

the good thing though it that it should be confined to the ContainerFS files.

chuck

thewacokid commented 10 years ago

Hmm. Calling directly through into Index without going through Writefile is going to be an issue (since there's no current way to lock the mux) that needs some thought.

johnbent commented 10 years ago

Seems like it'd be cleaner to move that mux into Index since it protects Index structures anyway. Would that be problematic since it also protects WriteFile structures?

On Oct 16, 2013, at 3:10 PM, David Bonnie notifications@github.com<mailto:notifications@github.com> wrote:

Hmm. Calling directly through into Index without going through Writefile is going to be an issue (since there's no current way to lock the mux) that needs some thought.

— Reply to this email directly or view it on GitHubhttps://github.com/plfs/plfs-core/pull/317#issuecomment-26458691.

thewacokid commented 10 years ago

Perhaps, but if it's going to be a major effort, it might be worth waiting till the Index abstraction stuff is done anyway.

Beyond that, I'm somewhat humbled to report that, as of writing this, my test case that failed within 300-400 seconds is happily running along at well over 10 minutes of execution with a mux around the call into truncateHostIndex() in WriteFile::truncate().

I'll let the test run to completion (or failure) and push the change into master ASAP for testing based on the results.

johnbent commented 10 years ago

That's fantastic. Thanks Dave for pushing this so relentlessly. Thanks Chuck for the invaluable second set of eyes. Hopefully the Index abstraction will reduce the function name overloading some also.

On Wed, Oct 16, 2013 at 3:19 PM, David Bonnie notifications@github.comwrote:

Perhaps, but if it's going to be a major effort, it might be worth waiting till the Index abstraction stuff is done anyway.

Beyond that, I'm somewhat humbled to report that, as of writing this, my test case that failed within 300-400 seconds is happily running along at well over 10 minutes of execution with a mux around the call into truncateHostIndex() in WriteFile::truncate().

I'll let the test run to completion (or failure) and push the change into master ASAP for testing based on the results.

— Reply to this email directly or view it on GitHubhttps://github.com/plfs/plfs-core/pull/317#issuecomment-26459681 .

thewacokid commented 10 years ago

Tests died in 1591 seconds.

That's a long shot better than it was before, I'll run it again to confirm then put in a pull request.

thewacokid commented 10 years ago

Blah. I hate race conditions. 404 seconds for the second iteration. Still, it should definitely be there, so I'll put in a pull request for it.

chuckcranor commented 10 years ago

On Wed, Oct 16, 2013 at 02:37:55PM -0700, David Bonnie wrote:

Tests died in 1591 seconds. That's a long shot better than it was before, I'll run it again to confirm then put in a pull request.

cool! did the stack trace change?

chuck


Reply to this email directly or view it on GitHub: https://github.com/plfs/plfs-core/pull/317#issuecomment-26460995