RIO - Distribution - Githubissues

amarts commented 7 years ago

RIO-Distribution: Relation Inherited Object Distribution

RIO takes the approach of separating the name, inode and data objects in a file system, and provides a mechanism to distribute these independent of each other but retaining relation inheritance as needed. This is done to address scalability and consistency in the distributed nature of Gluster. Further it attempts to retain the current performance characteristics and possibly improving the same in certain cases.

Like a typical VFS [9] based file system, RIO separates the name and inode for any object in the file system, the name appears under the parent inode, and the inode itself may reside on another subvolume (or is distributed elsewhere). For gluster specifics, the inode# is akin to the GFID name of the object, and the name is its dentry. Essentially, there is only one copy of the object as the distribution layer views it (unlike directories everywhere, which is how the current distribution layer is designed). This helps in retaining consistency guarantees as there is only one parent object to work with, and enables scale as directories are not everywhere (reducing fan out operations needed for directories).

NOTE: Directories objects are not constrained to reside on the same subvolume as its parent, whereas file objects are constrained (till a rename or a hardlink) to reside with its parent (for enabling faster lookups, sans techniques like caching).

As all file inodes under a parent reside on the same subvolume as its parent, the data for these inodes are separated and stored in other subvolumes, to achieve data distribution as well. This brings about a distinction in the types of subvolumes that RIO handles, namely metadata subvolume (MDS) and data subvolume (DS), but these are not singleton by design and hence form a metadata cluster (MDC), and data cluster (DC) (MDC/DC can be co-resident, further MDS and DS can also be co-resident).

RIO, also brings in GFID based object location in the cluster, IOW, instead of the name determining which subvolume the object belongs, it's GFID is used for such a determination. This simplifies some complex POSIX operations like rename and link, and further enables the possibility to store/have a single layout for all objects rather than a per directory layout.

Having a single layout simplifies consistency needs across the clients and the storage cluster, as the cluster layout is a singleton.

Among other scale needs/issues in gluster, scaling out a volume has always incurred a rebalance, that is optimal in certain cases, but non-optimal in a few others. This led to a lot of data movement, and thus time taken to rebalance the cluster. In RIO, the layout is already split into multiple units (larger than the actual brick counts) and thus enables more optimal data movement, and as a result aims to improve the overall amount of data moved during a rebalance and the time taken to rebalance.

The current RIO graph design would look like this (Added the image here for clarity) rio-withafr-graph dot

On disk representation of RIO objects

RIO separates the name, inode# and data blocks for an object in the local FS that backs the bricks. It leverages the local FS for directory inodes, IOW just created entries within the same, and hence directories do not have data blocks that belong to them or need to be retained by RIO. File inodes, will hence have all three objects on the local FS, its name under its parent (with a GFID xattr), the GFID itself (which is its inode), and data block(s) located using the same GFID (removes needs to store block pointers on the inode).

From a client perspective the namespace may look like, NOTE: (GFID: #) will not appear as part of the name, it is added here for GFID# reference when discussing the brick views

. (mount root/volume root) (GFID: 0001)
├── dir0 (GFID: 1011)
│   └── dir1 (GFID: 2012)
│       └── file1 (GFID: 2013)
├── dir2 (GFID: 3014)
│   ├── dir1 (GFID: 1015)
│   │   └── file2 (GFID: 1016)
│   └── file2 (GFID: 3017)
├── dir3 (GFID: 0018)
└── file0 (GFID: 0019)

A consolidate view of MDC would look like, NOTE: data in () are presented to understand the relations between the objects, and not a part of the name stored on disk

. (brick root)
├── 00
│   ├── 01          (GFID: 0001, name: / (volume root))
│   │   └── file0   (GFID: 0019)
│   ├── 18          (GFID: 0018, name: dir3)
│   └── 19          (GFID: 0019, name: file0)
├── 10
│   ├── 11          (GFID: 1011, name: dir0)
│   ├── 15          (GFID: 1015, name: dir1)
│   │   └── file2   (GFID: 1016)
│   └── 16          (GFID: 1016, name: file2)
├── 20
│   ├── 12          (GFID: 2012, name: dir1)
│   │   └── file1   (GFID: 2013)
│   └── 13          (GFID: 2013, name: file1)
└── 30
    ├── 14          (GFID: 3014), name: dir2)
    │   └── file2   (GFID: 3017)
    └── 17          (GFID: 3017, name: file2)

Currently based on the layout design the above consolidated view would end up in different bricks. The current layout scheme is discussed in [10]

NOTE: RIO was originally named DHT2 (for lack of a better name or imagination), it has since been named RIO. Hence when searching for details around RIO, a possible option is to start with DHT2, or start with this github issue, where we start providing links to various documents that discuss the details of this feature when it was named DHT2.

History:

The seed idea for RIO was announced to the lists in [1] (see section DHT2)

Further to this, the core need for a new distribution scheme was presented at the Gluster Developer Summit at Barcelona in 2015, by @jdarcy, the recording of the presentation is at [2]. (RemoveMe: missing the slide deck, if we get a reference to it will be added to [2])

Subsequently, core architecture considerations for RIO was led by @jdarcy and discussed in the community at [3] and [4]

Initial design

The initial design specification/considerations are captured in the presentations at [5]

Further post the proof of concept work done at [6], a revised design was presented at the Berlin Gluster Developer Summit [7]. Some aspects of learning from this design are captured as a individual documents in [8].

Current work

This github issue now tracks the current work (when it lands in master, as it is currently being developed in the experimental branch).

Links

[1] Seed idea on RIO(DHT2): https://docs.google.com/document/d/15_TOW9jwzW4griAmk-rqg2cWF-LHiR_TJ8Jn0vOvYpU/edit?usp=sharing
[2] Barcelona summit presentation:
- Slide deck: <missing/unable to find>
- Recording: https://www.youtube.com/watch?v=Hg24zzg2Iqw
Follow up Architecture discussions
- [3] Arch discussions round 1: https://goo.gl/tLpqJO
- [4] Arch discussions round 2: https://goo.gl/dCAO36
[5] Initial design elaboration:
- https://docs.google.com/presentation/d/1mzN0SSoz6JTXnQkLrQDq0qieKZ_MDfnEMlpLzb4bYDQ/edit?usp=sharing
- https://docs.google.com/presentation/d/1ncjMSH2-nCyibNNrHXzYtdaUp7LtMh_oD5jv2v_02rY/edit?usp=sharing
[6] POC work: https://review.gerrithub.io/#/q/project:ShyamsundarR/glusterfs
[7] Berlin summit presentation:
- Slideshare: https://www.slideshare.net/GlusterCommunity/dht2-o-brother-where-art-thou-with-shyam-ranganathan?qid=e5f11a44-44ef-4049-8e6c-11362a24e3aa&v=&b=&from_search=2
- Recording: https://www.youtube.com/watch?v=3J9qgmuLiQQ&t=1092s
[8] Various design documents and artifacts: https://review.gluster.org/#/c/13395/
[9] Overview of the Linux Virtual File System: https://www.kernel.org/doc/Documentation/filesystems/vfs.txt
[10] Current RIO layout scheme: https://review.gluster.org/#/c/13395/3/design/DHT2/DHT2_2nd_Prototype_DentryInodeAndData_Separation.md

ShyamsundarR commented 7 years ago

We will redo some commits to point to this issue, as we start out working on experimental branch on this feature.

Over time as this issue may get crowded with code commits, we would split this into sub-issues and start submitting code against those.

Added for attention: @kotreshhr @spalai

gluster-ant commented 7 years ago

A patch https://review.gluster.org/17684 has been posted that references this issue. Commit message: experimental/dht2: DHT2 initialization and layout abstraction

gluster-ant commented 7 years ago

A patch https://review.gluster.org/17684 has been posted that references this issue. Commit message: experimental/rio: RIO initialization and layout abstraction

gluster-ant commented 7 years ago

A patch https://review.gluster.org/17684 has been posted that references this issue. Commit message: experimental/rio: RIO initialization and layout abstraction

gluster-ant commented 7 years ago

A patch https://review.gluster.org/17684 has been posted that references this issue. Commit message: experimental/rio: RIO initialization and layout abstraction

gluster-ant commented 7 years ago

A patch https://review.gluster.org/17684 has been posted that references this issue. Commit message: experimental/rio: RIO initialization and layout abstraction

gluster-ant commented 7 years ago

A patch https://review.gluster.org/17964 has been posted that references this issue. Commit message: experimental/rio: client fop-generator

ShyamsundarR commented 7 years ago

We (folks developing RIO code on experimental branch) stopped using the github issue for all the commits, thinking that we will circle back on this when we add the same to master (still debating if that was a good idea). Anyway, the result is, there is no place where one can refer to the list of patches submitted, so here they are,

RIO experimental commits:

RIO initialization and layout abstraction: https://review.gluster.org/#/c/17684/
Script to generate rio volfiles: https://review.gluster.org/#/c/17701/
Reorganize posix xlator to prepare for reuse with rio: https://review.gluster.org/#/c/17990/
Further reorganize posix xlator code for rio : https://review.gluster.org/#/c/17998/
Some further reorganization of posix xlator: https://review.gluster.org/#/c/18013/
Include posix inode/fd ops, implement entry ops : https://review.gluster.org/#/c/18032/
add icreate/namelink fop: https://review.gluster.org/#/c/18085/
io-threads: add icreate/namelink fop: https://review.gluster.org/#/c/18086/
Some generic fixes as the code is excercised: https://review.gluster.org/#/c/18110/
posix2: fix ./tests/basic/0symbol-check.t : https://review.gluster.org/#/c/18132/
protocol: add icreate/namelink: https://review.gluster.org/#/c/18094/
server/protocol: Stop lookups from brick process: https://review.gluster.org/#/c/18246/
Added client FOP generation code and create code: https://review.gluster.org/#/c/18176/
Add mkdir FOP: https://review.gluster.org/#/c/18270/
Add ability to handle remote inodes in lookup: https://review.gluster.org/#/c/18295/

ShyamsundarR commented 6 years ago

We would like to land a tech preview of RIO into 4.0, this would be a minimum viable functionality that can enable users to take a peek at how RIO works and what the bricks look like.

From a timeline perspective, it may not get in if 4.0 is branching in mid December, but if 4.0 branching is around mid Jan it would be feasible to land the same.

The tech preview would have the following support,

GFAPI and FUSE based access support
All FOPs supported
- Exceptions would be: rename across different parent directories, and possibly hardlinks
No rebalance support
No Quota and geo-rep support
No integration testing with all xlators below current DHT in graph (IOW, no replication or erasure coding as supported rio based volumes)
No orphan inode garbage collection, or dirty inode journal support