gluster / glusterfs

Gluster Filesystem : Build your distributed storage in minutes
https://www.gluster.org
GNU General Public License v2.0
4.69k stars 1.08k forks source link

RIO - Distribution #243

Closed amarts closed 4 years ago

amarts commented 7 years ago

RIO-Distribution: Relation Inherited Object Distribution

RIO takes the approach of separating the name, inode and data objects in a file system, and provides a mechanism to distribute these independent of each other but retaining relation inheritance as needed. This is done to address scalability and consistency in the distributed nature of Gluster. Further it attempts to retain the current performance characteristics and possibly improving the same in certain cases.

Like a typical VFS [9] based file system, RIO separates the name and inode for any object in the file system, the name appears under the parent inode, and the inode itself may reside on another subvolume (or is distributed elsewhere). For gluster specifics, the inode# is akin to the GFID name of the object, and the name is its dentry. Essentially, there is only one copy of the object as the distribution layer views it (unlike directories everywhere, which is how the current distribution layer is designed). This helps in retaining consistency guarantees as there is only one parent object to work with, and enables scale as directories are not everywhere (reducing fan out operations needed for directories).

NOTE: Directories objects are not constrained to reside on the same subvolume as its parent, whereas file objects are constrained (till a rename or a hardlink) to reside with its parent (for enabling faster lookups, sans techniques like caching).

As all file inodes under a parent reside on the same subvolume as its parent, the data for these inodes are separated and stored in other subvolumes, to achieve data distribution as well. This brings about a distinction in the types of subvolumes that RIO handles, namely metadata subvolume (MDS) and data subvolume (DS), but these are not singleton by design and hence form a metadata cluster (MDC), and data cluster (DC) (MDC/DC can be co-resident, further MDS and DS can also be co-resident).

RIO, also brings in GFID based object location in the cluster, IOW, instead of the name determining which subvolume the object belongs, it's GFID is used for such a determination. This simplifies some complex POSIX operations like rename and link, and further enables the possibility to store/have a single layout for all objects rather than a per directory layout.

Having a single layout simplifies consistency needs across the clients and the storage cluster, as the cluster layout is a singleton.

Among other scale needs/issues in gluster, scaling out a volume has always incurred a rebalance, that is optimal in certain cases, but non-optimal in a few others. This led to a lot of data movement, and thus time taken to rebalance the cluster. In RIO, the layout is already split into multiple units (larger than the actual brick counts) and thus enables more optimal data movement, and as a result aims to improve the overall amount of data moved during a rebalance and the time taken to rebalance.

The current RIO graph design would look like this (Added the image here for clarity) rio-withafr-graph dot

On disk representation of RIO objects

RIO separates the name, inode# and data blocks for an object in the local FS that backs the bricks. It leverages the local FS for directory inodes, IOW just created entries within the same, and hence directories do not have data blocks that belong to them or need to be retained by RIO. File inodes, will hence have all three objects on the local FS, its name under its parent (with a GFID xattr), the GFID itself (which is its inode), and data block(s) located using the same GFID (removes needs to store block pointers on the inode).

From a client perspective the namespace may look like, NOTE: (GFID: #) will not appear as part of the name, it is added here for GFID# reference when discussing the brick views

. (mount root/volume root) (GFID: 0001)
├── dir0 (GFID: 1011)
│   └── dir1 (GFID: 2012)
│       └── file1 (GFID: 2013)
├── dir2 (GFID: 3014)
│   ├── dir1 (GFID: 1015)
│   │   └── file2 (GFID: 1016)
│   └── file2 (GFID: 3017)
├── dir3 (GFID: 0018)
└── file0 (GFID: 0019)

A consolidate view of MDC would look like, NOTE: data in () are presented to understand the relations between the objects, and not a part of the name stored on disk

. (brick root)
├── 00
│   ├── 01          (GFID: 0001, name: / (volume root))
│   │   └── file0   (GFID: 0019)
│   ├── 18          (GFID: 0018, name: dir3)
│   └── 19          (GFID: 0019, name: file0)
├── 10
│   ├── 11          (GFID: 1011, name: dir0)
│   ├── 15          (GFID: 1015, name: dir1)
│   │   └── file2   (GFID: 1016)
│   └── 16          (GFID: 1016, name: file2)
├── 20
│   ├── 12          (GFID: 2012, name: dir1)
│   │   └── file1   (GFID: 2013)
│   └── 13          (GFID: 2013, name: file1)
└── 30
    ├── 14          (GFID: 3014), name: dir2)
    │   └── file2   (GFID: 3017)
    └── 17          (GFID: 3017, name: file2)

Currently based on the layout design the above consolidated view would end up in different bricks. The current layout scheme is discussed in [10]

NOTE: RIO was originally named DHT2 (for lack of a better name or imagination), it has since been named RIO. Hence when searching for details around RIO, a possible option is to start with DHT2, or start with this github issue, where we start providing links to various documents that discuss the details of this feature when it was named DHT2.

History:

The seed idea for RIO was announced to the lists in [1] (see section DHT2)

Further to this, the core need for a new distribution scheme was presented at the Gluster Developer Summit at Barcelona in 2015, by @jdarcy, the recording of the presentation is at [2]. (RemoveMe: missing the slide deck, if we get a reference to it will be added to [2])

Subsequently, core architecture considerations for RIO was led by @jdarcy and discussed in the community at [3] and [4]

Initial design

The initial design specification/considerations are captured in the presentations at [5]

Further post the proof of concept work done at [6], a revised design was presented at the Berlin Gluster Developer Summit [7]. Some aspects of learning from this design are captured as a individual documents in [8].

Current work

This github issue now tracks the current work (when it lands in master, as it is currently being developed in the experimental branch).

Links

ShyamsundarR commented 7 years ago

We will redo some commits to point to this issue, as we start out working on experimental branch on this feature.

Over time as this issue may get crowded with code commits, we would split this into sub-issues and start submitting code against those.

Added for attention: @kotreshhr @spalai

gluster-ant commented 7 years ago

A patch https://review.gluster.org/17684 has been posted that references this issue. Commit message: experimental/dht2: DHT2 initialization and layout abstraction

gluster-ant commented 7 years ago

A patch https://review.gluster.org/17684 has been posted that references this issue. Commit message: experimental/rio: RIO initialization and layout abstraction

gluster-ant commented 7 years ago

A patch https://review.gluster.org/17684 has been posted that references this issue. Commit message: experimental/rio: RIO initialization and layout abstraction

gluster-ant commented 7 years ago

A patch https://review.gluster.org/17684 has been posted that references this issue. Commit message: experimental/rio: RIO initialization and layout abstraction

gluster-ant commented 7 years ago

A patch https://review.gluster.org/17684 has been posted that references this issue. Commit message: experimental/rio: RIO initialization and layout abstraction

gluster-ant commented 7 years ago

A patch https://review.gluster.org/17964 has been posted that references this issue. Commit message: experimental/rio: client fop-generator

ShyamsundarR commented 7 years ago

We (folks developing RIO code on experimental branch) stopped using the github issue for all the commits, thinking that we will circle back on this when we add the same to master (still debating if that was a good idea). Anyway, the result is, there is no place where one can refer to the list of patches submitted, so here they are,

RIO experimental commits:

ShyamsundarR commented 6 years ago

We would like to land a tech preview of RIO into 4.0, this would be a minimum viable functionality that can enable users to take a peek at how RIO works and what the bricks look like.

From a timeline perspective, it may not get in if 4.0 is branching in mid December, but if 4.0 branching is around mid Jan it would be feasible to land the same.

The tech preview would have the following support,

gluster-ant commented 6 years ago

A patch https://review.gluster.org/18811 has been posted that references this issue. Commit message: rio/everywhere: add icreate/namelink fop

gluster-ant commented 6 years ago

A patch https://review.gluster.org/18811 has been posted that references this issue. Commit message: rio/everywhere: add icreate/namelink fop

gluster-ant commented 6 years ago

A patch https://review.gluster.org/18811 has been posted that references this issue. Commit message: rio/everywhere: add icreate/namelink fop

gluster-ant commented 6 years ago

A patch https://review.gluster.org/18988 has been posted that references this issue. Commit message: experimental/rio: RIO initialization and layout abstraction

gluster-ant commented 6 years ago

A patch https://review.gluster.org/19449 has been posted that references this issue. Commit message: experimental/rio: Added support for fsync FOP

gluster-ant commented 6 years ago

A patch https://review.gluster.org/20129 has been posted that references this issue. Commit message: posix2: Fix Makefile to include newer posix sources

gluster-ant commented 6 years ago

A patch https://review.gluster.org/20129 has been posted that references this issue. Commit message: posix2: Fix Makefile to include newer posix sources

gluster-ant commented 6 years ago

A patch https://review.gluster.org/20561 has been posted that references this issue. Commit message: tests/riocreate: mark as a known issue

amarts commented 4 years ago

Thinking more about the design, I believe a Data / Metadata separator xlator would be a good 'enabler' for RIO. I wrote what would it look like in a document. It should be sufficient to solve the 'ls -l' type of problem of GlusterFS, and even scale related issue of gluster, IMO. Feedback welcome.

sheenobu commented 4 years ago

I actually built something similar to this in 2019, straight down to the first xlator being metadata and second xlator being data AND /gfid flattened paths. Went through the doc to see where it overlapped, and things like fsync weren't 'both', sadly, because I do not look forward to tracking fd associations. I had also swapped the order on some of them, which does not take into account ACLs, so your document was super helpful!

'ls -l' was solved in a very, very hacky way because the xlator was only meant to be on the brick (essentially, allowing posix_pstat to wind 'up' other xlators to get the d_stat iatt):

  "metadisp" xlator -> 
      posix xlator (metadata-0) -> 
          posix_readdirp_fill
            posix_pstat (/item1) -> syncop_stat(data-0, /$gfid1, ...)
            posix_pstat (/item2) -> syncop_stat(data-0, /$gfid2, ...)
            ...

if there is any existing POC, I would love to look and compare OR if there isn't, put mine up as a WIP with the caveats that 1. I do not know how much time I can dedicate to it after publishing it 2. it's based on an older version so it might not work on master.

amarts commented 4 years ago

@sheenobu That is great to hear! There is not much done w.r.to coding for this effort. But after talking to @kotreshhr who worked with @ShyamsundarR earlier to get the RIO efforts going, looks like we can quickly develop a prototype from already done changes for RIO.

But in any case, having a codebase is useful. You can share your branch with me (or send PR to my repo, so we can discuss), or can send a patch to Gerrit if it is possible. Happy to join hands and brains in this effort, so we can evaluate if the problem can be solved.

Also note of caution from people who have experience in this domain: Not all scaling and performance problem of 'ls -l' was due to DHT. It is also due to how we store files on backend with hardlinks. It behaves fine till some time, and after a threshold of file count, the performance drops.

sheenobu commented 4 years ago

Thanks for the response. https://github.com/amarts/glusterfs/pull/5 . I did a PR on your branch but I can do a patch to gerrit once i get in the office tomorrow. I think i remember how to do gerrit. Don't know what we do for pre-commit coverity and code formatting processes. Oh I found the clang-format docs in coding-standard.md.

For ls -l, i assumed this wasn't a perf thing but a readdip on NFS issue. I skipped it since the test case works on FUSE, so far. I did confirm it on the glusterfs master from the github repo, though.

I'm more interested in supporting JBOD setups (1NVME/SSD, N HHD drives) so I was happy to see the google doc have a part for brick-level metadata dispersal. But if the same xlator can be useful in RIO, then that's great reuse.

gluster-ant commented 4 years ago

A patch https://review.gluster.org/24071 has been posted that references this issue.

metadisp: initial commit

Summary:

feature/metadisp is an xlator for performing "metadata dispersal" across multiple children. it does this by flattening the complex POSIX paths into /$GFID style paths, then forwarding the metadata operations to its first child and forwarding the data operations to its second child.

The purpose of this xlator is to allow separation of data and metadata, in cases where metadata might be stored in another format (embedded kv?), on another disk (ssd), on another host (dht2).

Change-Id: I392c8bd0c867a3237d144aea327323f700a2728d Updates: #243

gluster-ant commented 4 years ago

A patch https://review.gluster.org/24102 has been posted that references this issue.

metadisp: initial commit

Summary:

feature/metadisp is an xlator for performing "metadata dispersal" across multiple children. it does this by flattening the complex POSIX paths into /$GFID style paths, then forwarding the metadata operations to its first child and forwarding the data operations to its second child.

The purpose of this xlator is to allow separation of data and metadata, in cases where metadata might be stored in another format (embedded kv?), on another disk (ssd), on another host (dht2).

Change-Id: I2b72b1d7e48bf5adf98efed63f7b9fcaaa75ee8b Updates: #243

amarts commented 4 years ago

Given the flag, as there is surely a lot of documents made available for this feature, and multiple design specs.

amarts commented 4 years ago

Removed the Approval flags, as the issue is about RIO, and DS/MDS is a separate effort (to enable RIO faster). We will now track it through #816

stale[bot] commented 4 years ago

Thank you for your contributions. Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity. It will be closed in 2 weeks if no one responds with a comment here.

stale[bot] commented 4 years ago

Closing this issue as there was no update since my last update on issue. If this is an issue which is still valid, feel free to open it.