gluster / gluster-kubernetes

GlusterFS Native Storage Service for Kubernetes
Apache License 2.0
875 stars 389 forks source link

How to make volumes persistent? #498

Closed MikiTesi closed 6 years ago

MikiTesi commented 6 years ago

I have deployed gluster on my kubernetes cluster and everything worked flawlessly.

At first I used heketi-cli to create volumes, but I soon realised that it was lacking the ability to specify the node on which to create the volume. Specifying the node is essential to the project I'm working on, so I decided to stop using heketi-cli and I started using the gluster command line instead (by accessing a glusterfs pod in kubernetes). Once inside a glusterfs pod, I used the following command to create a volume on a specific node (Host) and in a specific directory (Brick): $ gluster volume create VolumeName transport tcp Host:/Brick force then I started it, I used it in a kubernetes deployment and everything worked great.

After a while I realised that my volume was saved in a directory (Brick) inside the glusterfs pod. This means that as soon as I restarted the machine that hosted the glusterfs pod, my volume was gone, because a pod is ephemeral and so are its contents, obviously. So my volume was not persistent.

So, my question is, how do I make a volume actually persistent? Am I doing something wrong?

phlogistonjohn commented 6 years ago

This issue confuses me a bit because if you're actually using the containers and pods set up by gk-deploy then none of this should be an issue as it is designed to handle this (assuming the block devices are real enough).

For the sake of argument, here's how the situation is supposed to work:

By doing all the above the devices that heketi manages for bricks and the volumes managed by gluster will be persistent as long as the block devices themselves are not ephemeral.

MikiTesi commented 6 years ago

Thanks for the reply. Just to be clear, I didn't use the single command $ ./gk-deploy -g to set up Gluster and Heketi, but rather I followed a step-by-step tutorial I found on the Internet. Though, I can assure you the tutorial followed the gluster-kubernetes project and only made use of files from this project.

I have four nodes (all remote virtual machines): one master and three storage nodes. When I created the three virtual machines for the storage, I provided each one of them with a secondary empty device called /dev/vdb.

My topology file correctly describes each node and its own /dev/vdb device. I'm pretty sure this worked, because I remember, after I provided the topology file, the output said each node and device had been added to the cluster successfully.

On each one of the three storage nodes I currently have one glusterfs pod, as it should be.

The way I see it, there are two possible explanations for my problem: 1) either I didn't configure something properly and my pod cannot access the node's /dev/vdb; 2) or I'm simply making a mistake while creating a volume.

As for explanation number 1, let me know if I can do anything to check that everything is configured properly. As for explanation number 2, this is the exact command I use to create a volume, as said earlier: $ gluster volume create VolumeName transport tcp Host:/Brick force Is there something wrong with this command? Can I place the brick wherever I want, or do I have restrictions?

On a last note, let me know if you know of a way to create a volume on a specific node by using heketi-cli instead. Maybe that'll simplify things.

MikiTesi commented 6 years ago

I have a new clue. When I first configured Heketi, it automatically created a default volume called heketidbstorage, which was saved in three bricks, one per node. I have found out that this volume is actually persistent, in the sense that it "survived" even after I restarted my nodes. All its three bricks are still there. This means that the persistence of gluster is actually working.

Therefore I'm probably just making a mistake when creating volumes manually. Is there a specific directory where I should place the brick of the volume I'm creating?

For the sake of clarity, here is where the first brick of heketidbstorage resides: Brick1: 10.244.2.0:/var/lib/heketi/mounts/vg_eb884ed17038ca04b72641e458395e94/brick_dc8f4dcbb5ad1697da08ccd86ec609de/brick

nixpanic commented 6 years ago

On Thu, Jul 05, 2018 at 12:01:45PM -0700, MikiTesi wrote:

Therefore I'm probably just making a mistake when creating volumes manually. Is there a specific directory where I should place the brick of the volume I'm creating?

The brick should be placed on a (XFS) filesystem that resides on a (LVM) logical volume on /dev/vdb. The location of the mountpoint is not really important.

If you do not need replication to guarantee the availability of the data, you can indeed create volumes with --durability=none. However most users do care about the data and a system that is unavailable should not affect the availability of the data. Hence replica-3 is the recommended default.

It is possibe to select the 'cluster' where heketi creates the volume. So, in case you always want to create volumes without replication, you could create one or more clusters with a single Gluster pod. This can be done in the topology.json, or with 'heketi-cli cluster create' and then 'heketi-cli node add'. The big advantage is that Heketi will now be responsible to configue the LVM+xfs+mountpoint and all other pieces, and you do not have to do that manually.

MikiTesi commented 6 years ago

Sorry for the noobish question, but how do I get an XFS filesystem on /dev/vdb? I remember I partitioned my /dev/vdb device with XFS by using the following command on each node: $ mkfs.xfs /dev/vdb. Though, this messed everything up: when I tried to add each node through the topology file, it said that the /dev/vdb devices could not be added because they were not empty.

So I wiped each /dev/vdb device. When the /dev/vdb devices were empty, I finally managed to add them through the topology file successfully.

Since the method I used to get an XFS filesystem was wrong, what is the right way to do it?

Thanks for the suggestion about heketi-cli by the way, that's really smart. Although, I'd prefer to stick to your first option because I'll have to work on the volumes and move them later on, so I'd rather use your first option because I think it'll give me more freedom.

phlogistonjohn commented 6 years ago

Just to be clear, I didn't use the single command $ ./gk-deploy -g to set up Gluster and Heketi, but rather I followed a step-by-step tutorial I found on the Internet. Though, I can assure you the tutorial followed the gluster-kubernetes project and only made use of files from this project.

OK, thanks for the info. It's a bit tricky to debug a scenario where the origin is unknown to us, providing a link to this tutorial may help us understand the issue better.

As for explanation number 1, let me know if I can do anything to check that everything is configured properly.

It's not perfect but providing the output of 'heketi-cli topology info' may help.

As for explanation number 2, this is the exact command I use to create a volume, as said earlier: $ gluster volume create VolumeName transport tcp Host:/Brick force Is there something wrong with this command? Can I place the brick wherever I want, or do I have restrictions? ... Therefore I'm probably just making a mistake when creating volumes manually. Is there a specific directory where I should place the brick of the volume I'm creating?

For the sake of clarity, here is where the first brick of heketidbstorage resides: Brick1: 10.244.2.0:/var/lib/heketi/mounts/vg_eb884ed17038ca04b72641e458395e94/brick_dc8f4dcbb5ad1697da08ccd86ec609de/brick

OK, it's good to hear the heketidbstorage is working as expected. This implies that the issue isn't with the environment. When you provided us the gluster volume create command, I assume you meant you are still creating volumes directly through gluster and skipping heketi. If so, when you created the brick did you create storage within the lvm vg that heketi created? Or did you just run the command without creating a brick file system first?

I may be biased but I think you are better off allowing heketi to manage as much of the system as you'll let it. It really ought to help manage the low-level aspects of gluster and free you up to manage the more interesting aspects IMO.

On a last note, let me know if you know of a way to create a volume on a specific node by using heketi-cli instead. Maybe that'll simplify things.

I am still quite confused why you want to create volumes on specific nodes. I must assume you don't want replicated volumes, but as @nixpanic mentions this is supported by heketi, just not picking the exact node.

phlogistonjohn commented 6 years ago

Sorry for the noobish question, but how do I get an XFS filesystem on /dev/vdb? I remember I partitioned my /dev/vdb device with XFS by using the following command on each node: $ mkfs.xfs /dev/vdb. Though, this messed everything up: when I tried to add each node through the topology file, it said that the /dev/vdb devices could not be added because they were not empty.

So I wiped each /dev/vdb device. When the /dev/vdb devices were empty, I finally managed to add them through the topology file successfully.

You're running into a lot of extra work by trying to do some of the low-level steps manually. Heketi requires full control of the block devices (OK, there's wiggle room in this statement but its close to true). When you give it /var/vdb it will set it up as a pv, add vgs and lvs per-brick. Newer versions let you give it a parameter to wipe the device first, but heketi still assumes control.

When you wrote a filesystem to the device and did not provide the parameter to wipe the device heketi will just complain and error out.

Since the method I used to get an XFS filesystem was wrong, what is the right way to do it?

You can wipe the block device yourself (with wipefs -a for example) and provide it to heketi clean already or you can use the newer wipe option. Personally, I'd go with wipefs -a prior to providing heketi with the topology file because I know that works on all versions.

Thanks for the suggestion about heketi-cli by the way, that's really smart. Although, I'd prefer to stick to your first option because I'll have to work on the volumes and move them later on, so I'd rather use your first option because I think it'll give me more freedom.

I still think that many of your problems are arising from the fact you are trying to mix "manual" steps into an automatic system (glusterfs+heketi). You might consider letting the system do it's automatic thing for now and get familiar with how that works before deciding you need that extra freedom. Alternatively, you can try to set up a more traditional glusterfs scenario with vms that run gluster packages (rather than containers) by following the gluster docs, and get familiar with gluster itself.

phlogistonjohn commented 6 years ago

One more note: since you are deploying gluster in k8s I can only assume that you want to use gluster storage for your pods. Please be aware that you can not use dynamic provisioning without heketi, the dynamic provisioner is a layer around the heketi api and the heketi service handles practically all of the provisioning logic. So if you do decide you need manual control of the gluster volumes you will need to statically provision all of the volumes.

MikiTesi commented 6 years ago

OK, thanks for the info. It's a bit tricky to debug a scenario where the origin is unknown to us, providing a link to this tutorial may help us understand the issue better.

Here is the link to the tutorial: https://techdev.io/en/developer-blog/deploying-glusterfs-in-your-bare-metal-kubernetes-cluster . Though, I skipped the first part where it explains how to create fake block devices, because I provided all my virtual machine with real /dev/vdb devices instead. And I also skipped the final part where it explains how to configure dynamic provisioning, because I need static provisioning instead.

It's not perfect but providing the output of 'heketi-cli topology info' may help.

Here's the output:

Cluster Id: fdf8edf7ff40e205567ada0b17db35aa

File:  true
Block: true

Volumes:

Name: heketidbstorage
Size: 2
Id: b5cec2f747caace3e6717795d6fbe68c
Cluster Id: fdf8edf7ff40e205567ada0b17db35aa
Mount: 10.244.1.0:heketidbstorage
Mount Options: backup-volfile-servers=10.244.2.0,10.244.3.0
Durability Type: replicate
Replica: 3
Snapshot: Disabled

    Bricks:
        Id: 5a85c10cb78b7e124cb064ae982e3baf
        Path: /var/lib/heketi/mounts/vg_b4f087c2e27454ea3408b445e494e75d/brick_5a85c10cb78b7e124cb064ae982e3baf/brick
        Size (GiB): 2
        Node: 0ed02e17c13d2f1ec7cea54c1395a2e2
        Device: b4f087c2e27454ea3408b445e494e75d

        Id: dc8f4dcbb5ad1697da08ccd86ec609de
        Path: /var/lib/heketi/mounts/vg_eb884ed17038ca04b72641e458395e94/brick_dc8f4dcbb5ad1697da08ccd86ec609de/brick
        Size (GiB): 2
        Node: 305388a2d549c5673a4c121dcee3b9c9
        Device: eb884ed17038ca04b72641e458395e94

        Id: f80e60dbfc3d470d0abf01d9ef48e443
        Path: /var/lib/heketi/mounts/vg_2352a5aec1aba175e1d4038db20457b1/brick_f80e60dbfc3d470d0abf01d9ef48e443/brick
        Size (GiB): 2
        Node: 5f35839be1a7915c2bcd6fec2bd84301
        Device: 2352a5aec1aba175e1d4038db20457b1

Nodes:

Node Id: 0ed02e17c13d2f1ec7cea54c1395a2e2
State: online
Cluster Id: fdf8edf7ff40e205567ada0b17db35aa
Zone: 1
Management Hostnames: c1b21b38
Storage Hostnames: 10.244.1.0
Devices:
    Id:b4f087c2e27454ea3408b445e494e75d   Name:/dev/vdb            State:online    Size (GiB):14      Used (GiB):2       Free (GiB):12      
        Bricks:
            Id:5a85c10cb78b7e124cb064ae982e3baf   Size (GiB):2       Path: /var/lib/heketi/mounts/vg_b4f087c2e27454ea3408b445e494e75d/brick_5a85c10cb78b7e124cb064ae982e3baf/brick

Node Id: 305388a2d549c5673a4c121dcee3b9c9
State: online
Cluster Id: fdf8edf7ff40e205567ada0b17db35aa
Zone: 1
Management Hostnames: a22c3a9e
Storage Hostnames: 10.244.2.0
Devices:
    Id:eb884ed17038ca04b72641e458395e94   Name:/dev/vdb            State:online    Size (GiB):14      Used (GiB):2       Free (GiB):12      
        Bricks:
            Id:dc8f4dcbb5ad1697da08ccd86ec609de   Size (GiB):2       Path: /var/lib/heketi/mounts/vg_eb884ed17038ca04b72641e458395e94/brick_dc8f4dcbb5ad1697da08ccd86ec609de/brick

Node Id: 5f35839be1a7915c2bcd6fec2bd84301
State: online
Cluster Id: fdf8edf7ff40e205567ada0b17db35aa
Zone: 1
Management Hostnames: mysqlservermia
Storage Hostnames: 10.244.3.0
Devices:
    Id:2352a5aec1aba175e1d4038db20457b1   Name:/dev/vdb            State:online    Size (GiB):14      Used (GiB):2       Free (GiB):12      
        Bricks:
            Id:f80e60dbfc3d470d0abf01d9ef48e443   Size (GiB):2       Path: /var/lib/heketi/mounts/vg_2352a5aec1aba175e1d4038db20457b1/brick_f80e60dbfc3d470d0abf01d9ef48e443/brick

When you provided us the gluster volume create command, I assume you meant you are still creating volumes directly through gluster and skipping heketi. If so, when you created the brick did you create storage within the lvm vg that heketi created? Or did you just run the command without creating a brick file system first?

You are correct, I am skipping heketi, and I'm creating volumes directly through gluster. I just ran the command without creating a brick file system first...so I guess this is the problem. I literally just placed the brick in a random directory, like this: 10.244.1.0:/brick1. Where and how should I create a brick file system?

I am still quite confused why you want to create volumes on specific nodes. I must assume you don't want replicated volumes...

It's a matter of data locality. I want the data to be close to the application that needs it.

You can wipe the block device yourself (with wipefs -a for example) and provide it to heketi clean already or you can use the newer wipe option.

Don't worry about this. I've redone everything from scratch and I'm sure I provided empty devices to the topology. In fact, I remember getting a message saying each /dev/vdb device had been added successfuly in the end.

Please be aware that you can not use dynamic provisioning without heketi, the dynamic provisioner is a layer around the heketi api and the heketi service handles practically all of the provisioning logic. So if you do decide you need manual control of the gluster volumes you will need to statically provision all of the volumes.

Since dynamic provision won't let me choose the node where to place a volume, I think I will be fine using static provisioning. Correct me if I'm wrong, though.

I hope I've answered all your questsions and provided all the necessary information. Thank you so much for the support so far.

phlogistonjohn commented 6 years ago

You are correct, I am skipping heketi, and I'm creating volumes directly through gluster. I just ran the command without creating a brick file system first...so I guess this is the problem. I literally just placed the brick in a random directory, like this: 10.244.1.0:/brick1. Where and how should I create a brick file system?

Yes, a brick path dropped in the root filesystem in a container pod is ephemeral.

Please review: Gluster's guide for setting up storage especially the link Formatting and Mounting Bricks.

Note that if you had done this on a VM rather than in a container you would not have lost the data on reboot but the data would have resided on the same file system as the OS and this is highly discouraged as well.

As I mentioned in one of the earlier messages, trying to do this manually using the same block device you specified in the toplogy file will conflict with heketi.

It's a matter of data locality. I want the data to be close to the application that needs it.

Ah! That clarifies things a bit. Do you plan on having a very large cluster? Do you plan on using replicated volumes at all?

Since dynamic provision won't let me choose the node where to place a volume, I think I will be fine using static provisioning. Correct me if I'm wrong, though.

OK, if you are going to go down the static provisioning road you need to either eliminate heketi from the system or avoid using the block devices you've provided to heketi when setting things up manually. IOW, provide new devices for the manual config (/dev/vdc for example) and use the formatting and mounting guide linked above for those manually configured bricks.

Personally, I would benchmark your manual configuration versus the automatic configuration with and without replication. Then I would decide if you get a big enough performance boost from the locality of the volume when considering the additional costs of hand-rolling your configuration.

I hope I've answered all your questsions and provided all the necessary information. Thank you so much for the support so far.

Now that I understand your use case a bit better I appreciate your concern about the locality of data. However, that won't stop me from continuing to recommend you consider the costs of that approach:

Hope all that is informative, and good luck.

MikiTesi commented 6 years ago

I've re-read all of your messages and the links to the gluster docs you provided in your last message. I do admit I didn't quite grasp the complexity that is hidden behind heketi at first, but it now appears clear to me that I'm better off using heketi, rather than running into all the extra work by implementing my own solution manually.

To answer some of your concerns:

Do you plan on having a very large cluster?

I don't, I'm only going to work on a cluster made up of three nodes. Though, I'd like my solution to be scalable, so that it could theoretically be extended to a larger cluster in the future.

If you don't plan on pinning your application pods to said nodes, they may move and you'll lose locality anyway

I'll be able to pin each pod to a specific node by using a label and a nodeSelector, both features are offered by Kubernetes.

Do you plan on using replicated volumes at all?

I do.

Right now I believe that my best option could be to do what @nixpanic suggested in one of the earlier messages. That is, using heketi and creating one cluster per node. Therefore, each cluster will be made up of exactly one node. So whenever I create a volume (with heketi) I can specify the cluster on which to create the volume, which basically translates to specifying the node.

Right off the top of your head, do you think I will run into some limitations/disadvantages if I do follow this approach?

More specifically, I'm interested in these points:

  1. Will I be able to replicate a volume over different clusters?
  2. Can a single node belong to more than one cluster at the same time?
  3. Will I be able to move a volume from a cluster to another? This is the most critical point, since I don't think there even exists a way to move a volume by using heketi at all. I've only found resources about gluster being able to migrate volumes by using the gluster volume replace-brick command. But I wonder if I will be able to integrate this gluster command with my approach.
MikiTesi commented 6 years ago

I have great news! You can forget my previous message, I think I found another solution. While looking around, I found that my default heketidbstorage volume was saved in this brick: Brick2: 10.244.1.0:/var/lib/heketi/mounts/vg_b4f087c2e27454ea3408b445e494e75d/brick_5a85c10cb78b7e124cb064ae982e3baf/brick

This led me to run a simple test. I skipped heketi here. I accessed a glusterfs pod and I ran the gluster volume create command by placing the new volume's brick in this directory: /var/lib/heketi/mounts/vg_b4f087c2e27454ea3408b445e494e75d/. I attached the volume to a kubernetes pod, I wrote data on it, I restarted the pod and I restarted the whole machine, just to be sure. After the restart, everything was still there. So I guess I finally managed to place a volume where I wanted and it all worked smoothly, because the data persisted.

Let me know if you see anything wrong with this, please. If not, then I guess I'm done. I'd like to thank you all for your great support. I'm sorry I opened this issue for something that wasn't really an issue with your project after all. I'd kindly ask you to wait a couple of days before closing it, because I'm going to run a few more tests, and if I have any more questions I'll ask them here. But I hope that's not the case :)

jarrpa commented 6 years ago

@MikiTesi I will echo what the others have said: what you're trying to do is a Very Bad Idea, and not something we would suggest using in production at all. It is not recommended to manually create GlusterFS volumes in a cluster that is being managed by heketi, this throws off several of heketi's functionalities that rely on known the full state of the storage cluster.

The one-node per cluster idea is as close as you're going to get. However, to respond to some of your questions, no you can't have the same node in more than one cluster and you can't have one volume span multiple clusters. If you want to make use of any replication, you need more than one node in a cluster. And if you're using replication, as John mentioned you will have no performance improvements on writes but may get some slight colocation benefits on reads.

This is all to say, if you continue down this road we will reserve the right to not help you further. :) If you're fine with that, please feel free to close this issue.