Burst Buffer mode broken

johnbent commented 10 years ago

All,

The IOStore code (while awesome) sadly broke burst buffer mode. IOStore currently rejects any and all paths that aren't predefined in the plfsrc. Burst buffer mode works by storing shadow paths in metalinks. So one node has /bb1 as a shadow and another node has /bb2 as a shadow. Now when the first node wants to read a file, it goes to the canonical container where it finds a metalink to /bb2. When it tries to actually do operations on /bb2, it fails because the plfsrc on the first node doesn't list /bb2 in the plfsrc.

And we don't want the first node to list /bb2 in its plfsrc. The only place that it could put it would be in canonical_backends or shadow_backends. Canonical is no good since we don't (currently) want canonical containers stored in burst buffers. Shadow is no good because we want each node to use a particular shadow (or subset) so we need a different shadow_backends defined with only a subset of shadows on each plfsrc.

One solution is to modify IOStore to allow previously unknown paths but this means that we'll have to put the IOStore type (glib,posix,etc) into the metalink.

Another solution is to create a new plfsrc directive called read_only_shadow_backends where we can list the other burst buffers that aren't used for writing. Then the first node will have /bb2 as a readonly_shadow and the second node will have /bb1 as a readonly_shadow.

chuckcranor commented 10 years ago

I looked at this a bit. If we've got:

node1: {canonical=/m/pana0,shadow=/bb_n1} node2: {canonical=/m/pana0,shadow=/bb_n2}

if node1 does a read and gets a meta link in the canonical container in /m/pana0 that points to /bb_n2, then what is the read supposed to do?

should the data in a node's burst buffer be available to the other remote nodes prior to the async transfer completeing?

notes on the current code path:

it does support putting IOStore type in the metalink, but for posix mounts it optimizes the "posix:" out to save space and maintain backwards compat with pre-IOStore plfs. if it put a pvfs, hdfs, or iofs shadow metalink in, then it would have the prefix.
the failure you are going to get would be: readMetalink() gets the metalink
- parses the metalink and calls plfs_phys_backlookup(cp, pmnt, backout, NULL) to look it up
  
  that plfs_phys_backlookup() is going to fail because the backend isn't listed in plfsrc.

basically it is trying to map the metalink back to a specific plfs_backend in the given PlfsMount and not finding it.

it is part of the code that lets plfs run multiple logical mount points at the same time with non-POSIX filesystems (i.e. filesystems that required you to "attach" to them before using them).

the code assumes that the Metalink is pointing to something that is in the current PlfsMount, and thus it's plfs_backend has already been allocated and properly attached to (so there is no further init needed to perform I/O). So even if we hit a Metalink with "pvfs://foo/bar/we/have/not/seen" in it we wouldn't be able to do I/O to it because it wouldn't be attached (pvfs client may not even be init'd).

i'm thinking it is not entirely a good idea to let PLFS do backend I/O to filesystems not listed in plfsrc anyway, since you easily get into a case where you've got a bad plfsrc and not even know it. so the option of listing a read-only shadow backend seems like the way to go to make this work.

internally, the way it could work is that these backends would be listed in PlfsMount->backends[] array, but not appear in either PlfsMount->shadow_backends[] nor PlfsMount->canonical_backends[]. there are prob some sanity checks in insert_mount_point that would have to get updated.

chuck

On Wed, Sep 18, 2013 at 02:34:17PM -0700, John Bent wrote:

The IOStore code (while awesome) sadly broke burst buffer mode. IOStore currently rejects any and all paths that aren't predefined in the plfsrc. Burst buffer mode works by storing shadow paths in metalinks. So one node has /bb1 as a shadow and another node has /bb2 as a shadow. Now when the first node wants to read a file, it goes to the canonical container where it finds a metalink to /bb2. When it tries to actually do operations on /bb2, it fails because the plfsrc on the first node doesn't list /bb2 in the plfsrc.

And we don't want the first node to list /bb2 in its plfsrc. The only place that it could put it would be in canonical_backends or shadow_backends. Canonical is no good since we don't (currently) want canonical containers stored in burst buffers. Shadow is no good because we want each node to use a particular shadow (or subset) so we need a different shadow_backends defined with only a subset of shadows on each plfsrc.

One solution is to modify IOStore to allow previously unknown paths but this means that we'll have to put the IOStore type (glib,posix,etc) into the metalink.

Another solution is to create a new plfsrc directive called read_only_shadow_backends where we can list the other burst buffers that aren't used for writing. Then the first node will have /bb2 as a readonly_shadow and the second node will have /bb1 as a readonly_shadow.

brettkettering commented 10 years ago

I think we need to establish well-defined requirements for what burst buffer mode is in PLFS. Then, we need to design an implementation that makes it overt and supported. We don't want to rely on hide it under the covers. We want the person who defines the PLFS mounts to be able to specify a mount with the supported components (posix, glibc, hdfs, burst buffer, etc.) that create the functionality in a PLFS mount that is needed for a given installation.

plfs / plfs-core

Burst Buffer mode broken #314