loopback 'block' CSI for GCS users

amarts commented 6 years ago

I am aware of many discussions where we did consider using 'loopback' devices (losetup) as a block device, where it just uses a file on glusterfs as backend.

This has both challenges and benefits. Happy to discuss this further as experimental option for GCS.

People with some thoughts on this, please share your opinion, observation, and requirements, so we can collect it, and see if we can come out with some design which is agreed from everyone!

@JohnStrunk @obnoxxx @raghavendra-talur @vbellur @pkalever @pranithk @ShyamsundarR @jarrpa @phlogistonjohn @humblec @lxbsz @aravindavk @Madhu-1 @atinmu @poornimag

jarrpa commented 6 years ago

Certainly this feature would mean a world of simplification from a CSI-based deployment standpoint. It would mean that a gluster-block node driver wouldn't need to bundle any special packages, and could just make use of the FUSE client in the glusterfs node driver.

phlogistonjohn commented 6 years ago

We briefly discussed other ways (in addition to loopback itself) to accomplish a similar behavior:

loopback
nbd (or nbdkit)
tcmu runner (without iscsi)

These would all accomplish the goal of isolating the "block-ness" within the CSI mount container and it would only "talk gluster" to the outside world. Certainly, loopback would be the easiest to prototoype, but one of the alternatives may prove more robust for production cases. More research is needed for those though :-)

CC: @nixpanic

nixpanic commented 6 years ago

One of the potential issues with a loopback device backed by a network filesystem is that there may be a deadlock when the system gets under memory pressure. This has been reported for loopback mounts on NFS-exports. The problem manifests itself like:

system has little or no free memory
filesystem on a loopback device gets instructed to flush data
memory allocations are needed to flush the data to the underlying network filesystem
the underlying network filesystem (NFS-client, possibly FUSE as well) needs to allocate memory to transport the data to the server
memory allocations trigger memory reclaiming, looping back to (2) above

This is something that needs careful reviewing and testing.

ShyamsundarR commented 6 years ago

I had posted this in a non-public list a while back, cleaned it up and posting it here again. It's long, but represents what we did with loopback devices and Gluster in an Openshift cluster.

Here is where we were (@JohnStrunk and self) at with our investigation on using loop back devices to serve RWO PVs in an Openshift cloud instance environment. Learning is not specific to that environment, hence posting it here.

Modal of delivery was based on k8s Flexvol driver, and the work can be found here [1]. This is mostly irrelevant to the discussion below, but better to get it out of the way.

The intention to look at loop back devices was to address scale and small file performance issues in the said Openshift clusters. The scale required was supporting 5k 1GB PVs from a single 3 node replicated volume, backed by a 1TB EBS volume, with 50% of the disk reserved for snapshots (so in essence 500GB disk for Gluster). The performance issue can be summarized time to untar a medium sized tarball, to a PV backed by Gluster in the FUSE mode (used as a shared file system), versus running it against a local FS backed by a FUSE file used as a block device, the latter of course wins (and has its own set of learnings to poke and prod).

Fact: Any local FS backed by a file over a network is faster, for small file workloads, as a lot of operations are actually happening on the page cache, and flushed when needed, thus reducing a lot of network chatter that is otherwise evident in any network file system (NFS, SMB, Gluster, ). This basically means we are reducing multiple network round trips to complete an operation, when using such mechanisms. As a result loop back devices are also fast. We are not (yet) bothered about are they faster than iSCSI or something else, as the speed gain needed was for the untar (like) workload, that any block based solution, improves by a factor of around 10x or more (and this is an extremely conservative number) [6].

It was pointed out earlier by @vbellur that loop devices on Fedora scale to the 1000's (I forget the exact number but I think he mentioned 8k). With this data, loop devices started to look interesting again, as part of the scale needs are met.

Reestablishing the scale test was done, and I scaled up to 1k devices. Further there was some reading about loopctl and loop device kernel code to start looking at this more factually than empirically [7].

One of the interesting discoveries (empirically again hence a discovery) was that I was able to go up to, creating backing file, XFS formatting and mounting ~815 loop devices on a 1GB RAM virtual (vagrant based) machine running RHEL 7, and to test up to 1k loop devices, I had to increase this to 2GB (which also gave it enough head room for other processes). This test though is to create and mount, not perform IO on these mounts. This is useful when considering the production environments and what machine requirements we need to set forth for them.

Next test/check was to ensure that loop devices support discard, so that an in use PV as it churns data, keeps the backing file size optimal. Loop devices on Linux, do support discard, and this works as required, hence this need was satisfied. This is key as we over-provision in the said Openshift environment by a factor of 10x, 1GB * 5000(PVs) = ~5TB and we have a 500 GB drive and so need to over provision or thin provision, and ensure consumed space is optimal and not wasted.

Also, thin-provisioning was possible as the backing file is created as a sparse file, with just a logical offset of 1GB.

Some further loop device kernel code reading, and looking at other block devices that support a similar kernel abstraction, helped understand that CPEH kRBD also uses a similar approach. At this point loop devices started looking more interesting, than possibly being treated as just a toy.

The problems to investigate then went to, write-ordering and ensuring we do not break it in Gluster. Write-ordering in the Linux kernel currently works as in [2]. With loop device driver, this gets transformed into a fsync [3]. So, again, using loop devices backed by a gluster file (accessed via a FUSE mount), satisfied this property (we do support fsync through our stack).

The next problem was to tackle double caching [4], and to ensure we do not get into a double cache situation with Gluster. This was simpler, like turning off the various data caches in the FUSE graph is almost enough (I would like to revisit this if we went down this path to ensure we are good at all layers in the client) to address this problem on the client machine. As gluster in this setup is not co-located with the client containers (like CNS), we did not need to bother about brick side XFS/device, page/buffer caching respectively.

The final problem to tackle was fencing the block device, IOW ensuring that there is only a single writer to the block device. This does not have a currently in built solution to the problem. The case is, when a client that was accessing the block device hangs or is unreachable, and the block device needs to be mounted elsewhere (assume another node), we need to fence the first client from ever writing to the block device. This means ensuring that the loop device is unmounted, by the first mount definitively, before the second mount is allowed to access the same, and in situations where the first mount is not accessible, either allow the second mount to fail (hence requires administrator intervention, which is a non-starter in such workloads) or succeed and fence the first mount.

The fencing of the first mount needs systems like single active owner to be implemented in Gluster (or single open fd, or such), or loop back device code needs to have changes to add mandatory locking or similar constructs to be a functional solution. As this involves code changes, and will hit a RHGS release later than when we want this feature in production, the loop back device line of investigation was stalled.

Work to do, things to think about:

Scale testing and memory and network usage at scale
Snapshots: Per-RWO PV snapshots has the same/similar problems as iSCSI based implementation

Overall, if fencing is addressed, loop back devices can help upto a point in addressing a more general purpose use case of block devices on Gluster. The remaining problems as in the current iSCSI based mechanism (like snapshots or cloning) still need to be addressed for both. Further, in GCS like environments the double caching has to be eliminated, as there is a fair probability that the client accessing the loop back is on the same node as the brick serving the same.

[1] loopback based PVs Flexvol github PR: https://github.com/gluster/gluster-subvol/pull/18

[2] Linux kernel write-ordering: https://lwn.net/Articles/400541/

[3] fsync in loop devices: https://github.com/torvalds/linux/blob/master/drivers/block/loop.c#L572

[4] Double caching:

[5] Fencing 101: https://en.wikipedia.org/wiki/Fencing_(computing)

[6] Small scale performance of loop Vs single-AZ Vs multi-AZ configurations: https://docs.google.com/spreadsheets/d/1P9B90RXc2BnicR9qfNTeKuQ--8m4K7eLYixgkW_d6t4/edit#gid=0

[7] loop id allocation (needs more code reading): https://github.com/torvalds/linux/blob/master/drivers/block/loop.c#L1772

gluster / gcs

loopback 'block' CSI for GCS users #62