Open amarts opened 6 years ago
Certainly this feature would mean a world of simplification from a CSI-based deployment standpoint. It would mean that a gluster-block node driver wouldn't need to bundle any special packages, and could just make use of the FUSE client in the glusterfs node driver.
We briefly discussed other ways (in addition to loopback itself) to accomplish a similar behavior:
These would all accomplish the goal of isolating the "block-ness" within the CSI mount container and it would only "talk gluster" to the outside world. Certainly, loopback would be the easiest to prototoype, but one of the alternatives may prove more robust for production cases. More research is needed for those though :-)
CC: @nixpanic
One of the potential issues with a loopback device backed by a network filesystem is that there may be a deadlock when the system gets under memory pressure. This has been reported for loopback mounts on NFS-exports. The problem manifests itself like:
This is something that needs careful reviewing and testing.
I had posted this in a non-public list a while back, cleaned it up and posting it here again. It's long, but represents what we did with loopback devices and Gluster in an Openshift cluster.
Here is where we were (@JohnStrunk and self) at with our investigation on using loop back devices to serve RWO PVs in an Openshift cloud instance environment. Learning is not specific to that environment, hence posting it here.
Modal of delivery was based on k8s Flexvol driver, and the work can be found here [1]. This is mostly irrelevant to the discussion below, but better to get it out of the way.
The intention to look at loop back devices was to address scale and small file performance issues in the said Openshift clusters. The scale required was supporting 5k 1GB PVs from a single 3 node replicated volume, backed by a 1TB EBS volume, with 50% of the disk reserved for snapshots (so in essence 500GB disk for Gluster). The performance issue can be summarized time to untar a medium sized tarball, to a PV backed by Gluster in the FUSE mode (used as a shared file system), versus running it against a local FS backed by a FUSE file used as a block device, the latter of course wins (and has its own set of learnings to poke and prod).
Fact: Any local FS backed by a file over a network is faster, for small file workloads, as a lot of operations are actually happening on the page cache, and flushed when needed, thus reducing a lot of network chatter that is otherwise evident in any network file system (NFS, SMB, Gluster,
It was pointed out earlier by @vbellur that loop devices on Fedora scale to the 1000's (I forget the exact number but I think he mentioned 8k). With this data, loop devices started to look interesting again, as part of the scale needs are met.
Reestablishing the scale test was done, and I scaled up to 1k devices. Further there was some reading about loopctl and loop device kernel code to start looking at this more factually than empirically [7].
One of the interesting discoveries (empirically again hence a discovery) was that I was able to go up to, creating backing file, XFS formatting and mounting ~815 loop devices on a 1GB RAM virtual (vagrant based) machine running RHEL 7, and to test up to 1k loop devices, I had to increase this to 2GB (which also gave it enough head room for other processes). This test though is to create and mount, not perform IO on these mounts. This is useful when considering the production environments and what machine requirements we need to set forth for them.
Next test/check was to ensure that loop devices support discard, so that an in use PV as it churns data, keeps the backing file size optimal. Loop devices on Linux, do support discard, and this works as required, hence this need was satisfied. This is key as we over-provision in the said Openshift environment by a factor of 10x, 1GB * 5000(PVs) = ~5TB and we have a 500 GB drive and so need to over provision or thin provision, and ensure consumed space is optimal and not wasted.
Also, thin-provisioning was possible as the backing file is created as a sparse file, with just a logical offset of 1GB.
Some further loop device kernel code reading, and looking at other block devices that support a similar kernel abstraction, helped understand that CPEH kRBD also uses a similar approach. At this point loop devices started looking more interesting, than possibly being treated as just a toy.
The problems to investigate then went to, write-ordering and ensuring we do not break it in Gluster. Write-ordering in the Linux kernel currently works as in [2]. With loop device driver, this gets transformed into a fsync [3]. So, again, using loop devices backed by a gluster file (accessed via a FUSE mount), satisfied this property (we do support fsync through our stack).
The next problem was to tackle double caching [4], and to ensure we do not get into a double cache situation with Gluster. This was simpler, like turning off the various data caches in the FUSE graph is almost enough (I would like to revisit this if we went down this path to ensure we are good at all layers in the client) to address this problem on the client machine. As gluster in this setup is not co-located with the client containers (like CNS), we did not need to bother about brick side XFS/device, page/buffer caching respectively.
The final problem to tackle was fencing the block device, IOW ensuring that there is only a single writer to the block device. This does not have a currently in built solution to the problem. The case is, when a client that was accessing the block device hangs or is unreachable, and the block device needs to be mounted elsewhere (assume another node), we need to fence the first client from ever writing to the block device. This means ensuring that the loop device is unmounted, by the first mount definitively, before the second mount is allowed to access the same, and in situations where the first mount is not accessible, either allow the second mount to fail (hence requires administrator intervention, which is a non-starter in such workloads) or succeed and fence the first mount.
The fencing of the first mount needs systems like single active owner to be implemented in Gluster (or single open fd, or such), or loop back device code needs to have changes to add mandatory locking or similar constructs to be a functional solution. As this involves code changes, and will hit a RHGS release later than when we want this feature in production, the loop back device line of investigation was stalled.
Work to do, things to think about:
Overall, if fencing is addressed, loop back devices can help upto a point in addressing a more general purpose use case of block devices on Gluster. The remaining problems as in the current iSCSI based mechanism (like snapshots or cloning) still need to be addressed for both. Further, in GCS like environments the double caching has to be eliminated, as there is a fair probability that the client accessing the loop back is on the same node as the brick serving the same.
[1] loopback based PVs Flexvol github PR: https://github.com/gluster/gluster-subvol/pull/18
[2] Linux kernel write-ordering: https://lwn.net/Articles/400541/
[3] fsync in loop devices: https://github.com/torvalds/linux/blob/master/drivers/block/loop.c#L572
[4] Double caching:
[5] Fencing 101: https://en.wikipedia.org/wiki/Fencing_(computing)
[6] Small scale performance of loop Vs single-AZ Vs multi-AZ configurations: https://docs.google.com/spreadsheets/d/1P9B90RXc2BnicR9qfNTeKuQ--8m4K7eLYixgkW_d6t4/edit#gid=0
[7] loop id allocation (needs more code reading): https://github.com/torvalds/linux/blob/master/drivers/block/loop.c#L1772
I am aware of many discussions where we did consider using 'loopback' devices (
losetup
) as a block device, where it just uses a file on glusterfs as backend.This has both challenges and benefits. Happy to discuss this further as experimental option for GCS.
People with some thoughts on this, please share your opinion, observation, and requirements, so we can collect it, and see if we can come out with some design which is agreed from everyone!
@JohnStrunk @obnoxxx @raghavendra-talur @vbellur @pkalever @pranithk @ShyamsundarR @jarrpa @phlogistonjohn @humblec @lxbsz @aravindavk @Madhu-1 @atinmu @poornimag