quantum / esos

An open source, high performance, block-level storage platform.
http://www.esos-project.com/
Other
287 stars 58 forks source link

Add -W flag to lvcreate from TUI #225

Open mastergregor opened 5 years ago

mastergregor commented 5 years ago

While re-organizing space on an ESOS box, I ran into an issue where lvcreate hang indefinitely. I have invoked a LV creation from TUI, and used a separate shell to dig a little bit. Upon closer inspection, since I was re-using space of a previous GPT formated space, I think I ran into an issue that is described briefly here:

https://github.com/saltstack/salt/pull/41798

It looks like lvcreate is invoked from TUI with --name, --size and --type options. In cases where space already has a recognizable partition table or file system, lvcreate will hang on user input.

Now, from a separate shell, I killed the lvcreate process, and created the LV manually from the command line, this time with the -W flag.

I was trying this on ESOS version 1.3.9, and the version of lvcreate that is included has a -W option to proceed with the creation in these circumstances.

It might be worth while to consider adding -W flag to the lvcreate when invoked from TUI. I am not sure if there are any consequences to including this flag though.

msmith626 commented 5 years ago
   -W|--wipesignatures {y|n}
          Controls detection and subsequent wiping of signatures on  newly
          created  Logical  Volume.  There's  a  prompt for each signature
          detected to confirm its wiping (unless --yes is used  where  LVM
          assumes  'yes'  answer  for  each prompt automatically). If this
          option is not specified, then by default -W | --wipesignatures y
          is  assumed  each time the zeroing is done (-Z | --zero y). This
          default      behaviour      can      be      controlled       by
          allocation/wipe_signatures_when_zeroing_new_lvs setting found in
          lvm.conf(5).
          If blkid wiping is used (allocation/use_blkid_wiping setting  in
          lvm.conf(5))  and  LVM2  is  compiled with blkid wiping support,
          then blkid(8) library is used  to  detect  the  signatures  (use
          blkid  -k  command  to list the signatures that are recognized).
          Otherwise, native LVM2 code is used  to  detect  signatures  (MD
          RAID, swap and LUKS signatures are detected only in this case).
          Logical volume is not wiped if the read only flag is set.

Can you give some background / usage that adding the '-W' flag would help you overcome or handle? This is the first request we've had for this...

On Sun, Dec 9, 2018 at 1:08 AM mastergregor notifications@github.com wrote:

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

mastergregor commented 5 years ago

Sure, I can explain how I got to the situation where I think I needed it.

I created 2 LVs and set them up as targets to two separate FC links. I use straight card-to-card connections, no switched fabric for now. Both of these LVs were set up with vdisk_blockio, and set up on Windows as ReFS drives with a single partition each. LVs were 2TB (lets call this one LV1) and 6TB (lets call this one LV2). Since size is more than 2TB, GPT was used for both drives. Let me try some ASCII graphics to illustrate :)

+----------------+-------------------------+ | --- LV1(2TB)---- |------LV2(6TB)---------- | +----------------+-------------------------+

This setup worked quite nicely for a few months now, and is used in a small business setting, in production, for shared storage for documentation, DB and media files.

A few weeks ago, I got another 6 drives to populate in the enclosure, and now I had another 13TB of storage to allocate. I have set up another LV with 13TB of storage, let's call that one LV3. This was going to be a bigger storage for media, so it was set up as a windows ReFS drive again, and exported as LUN1 to the same box LV2 was presented to (LV2 was LUN0 of course). Content of LV2 was then moved to LV3, and LV2 was deleted, freeing up the space. So, to illustrate again, storage looked like this:

+----------------+-------------------------+------------------------------------+ | ---- LV1(2TB) ---|------FREE SPACE------- |--------LV3 (13TB)------------------ | +----------------+-------------------------+------------------------------------+

Now, this free space I decided to dedicate to some linux boxes (web server, DB storage, etc), and I decided to make 2 separate LVs from it, each being 3TB.

When I tried to create the first new LV, from the chunk of now free space, from TUI of the ESOS box, lvcreate got stuck forever. (well, for a good 4 hours before I killed it manually).

What I think happened, as the LVM tried to create LV from the free space, it encountered a GPT (GUID partition table) signature, since it would have been at the beginning sectors of the free space, remaining from previous LV. It then went to prompt for confirmation, but since it was invoked from a subshell (and output redirected to /dev/null :) it never got the answer it wanted, so it waited forever.

I was able to kill it from a separate ssh session, and then manually issue lvcreate, with the same params. This confirmed the problem. I did not complete the LV creation, but instead looked into it a bit, and found a similar problem from salt project. When I put -W on the command line, creation went through without a prompt.

Now, I am not sure if "-W" is appropriate, or if maybe some other behavior would be better here, like zeroing out the partition, or setting it in lvm.conf, or some other mechanism for continuing in these cases. Maybe ask user through TUI if they are sure?

I was just bringing up a scenario where lvcreate might get stuck when invoked from TUI. If you have a test env handy you could try to reproduce the issue and verify that I did not have a glitch of some sort. At this point I am not really willing to mess with a setup that is in daily use :)

msmith626 commented 5 years ago

I agree the possible interactive behavior when running the commands needs to be solved, but I'm not sure I agree with using '--yes' to always wipe the device. I expect this could be safety catch, so probably don't want to wipe unconditionally.

What about if we just get it to return an error if this can't be done... then the user can try running at at the shell and determine how they'd like to handle it.

-q|--quiet Suppress output and log messages. Overrides -d and -v. Repeat once to also suppress any prompts with answer 'no'.

This option might be what we want in the TUI... answer "no" to any interactive prompts automatically. It would be ideal if this also returns a non-zero status. Any chance you could test this behavior on another non-production system?

--Marc

On Sun, Dec 9, 2018 at 11:29 PM mastergregor notifications@github.com wrote:

Sure, I can explain how I got to the situation where I think I needed it.

I created 2 LVs and set them up as targets to two separate FC links. I use straight card-to-card connections, no switched fabric for now. Both of these LVs were set up with vdisk_blockio, and set up on Windows as ReFS drives with a single partition each. LVs were 2TB (lets call this one LV1) and 6TB (lets call this one LV2). Since size is more than 2TB, GPT was used for both drives. Let me try some ASCII graphics to illustrate :)

+----------------+-------------------------+ | LV1(2TB) | LV2(6TB) | +----------------+-------------------------+

This setup worked quite nicely for a few months now, and is used in a small business setting, in production, for shared storage for documentation, DB and media files.

A few weeks ago, I got another 6 drives to populate in the enclosure, and now I had another 13TB of storage to allocate. I have set up another LV with 13TB of storage, let's call that one LV3. This was going to be a bigger storage for media, so it was set up as a windows ReFS drive again, and exported as LUN1 to the same box LV2 was presented to (LV2 was LUN0 of course). Content of LV2 was then moved to LV3, and LV2 was deleted, freeing up the space. So, to illustrate again, storage looked like this:

+----------------+-------------------------+------------------------------------+ | LV1(2TB) | FREE SPACE | LV3 (13TB) |

+----------------+-------------------------+------------------------------------+

Now, this free space I decided to dedicate to some linux boxes (web server, DB storage, etc), and I decided to make 2 separate LVs from it, each being 3TB.

When I tried to create the first new LV, from the chunk of now free space, from TUI of the ESOS box, lvcreate got stuck forever. (well, for a good 4 hours before I killed it manually).

What I think happened, as the LVM tried to create LV from the free space, it encountered a GPT (GUID partition table) signature, since it would have been at the beginning sectors of the free space, remaining from previous LV. It then went to prompt for confirmation, but since it was invoked from a subshell (and output redirected to /dev/null :) it never got the answer it wanted, so it waited forever.

I was able to kill it from a separate ssh session, and then manually issue lvcreate, with the same params. This confirmed the problem. I did not complete the LV creation, but instead looked into it a bit, and found a similar problem from salt project. When I put -W on the command line, creation went through without a prompt.

Now, I am not sure if "-W" is appropriate, or if maybe some other behavior would be better here, like zeroing out the partition, or setting it in lvm.conf, or some other mechanism for continuing in these cases. Maybe ask user through TUI if they are sure?

I was just bringing up a scenario where lvcreate might get stuck when invoked from TUI. If you have a test env handy you could try to reproduce the issue and verify that I did not have a glitch of some sort. At this point I am not really willing to mess with a setup that is in daily use :)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/parodyne/esos/issues/225#issuecomment-445674996, or mute the thread https://github.com/notifications/unsubscribe-auth/AFkL3vvYq3nCbvacdtKCZ7wRq3WIaghbks5u3eM9gaJpZM4ZJ-T6 .

mastergregor commented 5 years ago

Quiet might be a good option - I wanted to suggest that lvcreate can fail, and then TUI can prompt for confirmation of a whipeout and do it again with -yes, but after some coffee, I think that would only open the door for potential strange behaviors from the unknown situations as this point. So I do like a suggestion to just error out, and let user do it manually, since sensitive systems like storage should be handled by people who know what they are doing anyhow. As for testing, it will be a month or two before I can test it properly, I am thinking of getting a sister enclosure for HA setup, but this has to wait for Q1-2019 funding, since I would need an FC switch at that time as well. It is a Christmas season anyhow, time for skiing and family :)

Gregor

On Monday, December 10, 2018, 9:48:54 AM CST, Marc A. Smith <notifications@github.com> wrote:  

I agree the possible interactive behavior when running the commands needs to be solved, but I'm not sure I agree with using '--yes' to always wipe the device. I expect this could be safety catch, so probably don't want to wipe unconditionally.

What about if we just get it to return an error if this can't be done... then the user can try running at at the shell and determine how they'd like to handle it.

-q|--quiet Suppress output and log messages. Overrides -d and -v. Repeat once to also suppress any prompts with answer 'no'.

This option might be what we want in the TUI... answer "no" to any interactive prompts automatically. It would be ideal if this also returns a non-zero status. Any chance you could test this behavior on another non-production system?

--Marc

On Sun, Dec 9, 2018 at 11:29 PM mastergregor notifications@github.com wrote:

Sure, I can explain how I got to the situation where I think I needed it.

I created 2 LVs and set them up as targets to two separate FC links. I use straight card-to-card connections, no switched fabric for now. Both of these LVs were set up with vdisk_blockio, and set up on Windows as ReFS drives with a single partition each. LVs were 2TB (lets call this one LV1) and 6TB (lets call this one LV2). Since size is more than 2TB, GPT was used for both drives. Let me try some ASCII graphics to illustrate :)

+----------------+-------------------------+ | LV1(2TB) | LV2(6TB) | +----------------+-------------------------+

This setup worked quite nicely for a few months now, and is used in a small business setting, in production, for shared storage for documentation, DB and media files.

A few weeks ago, I got another 6 drives to populate in the enclosure, and now I had another 13TB of storage to allocate. I have set up another LV with 13TB of storage, let's call that one LV3. This was going to be a bigger storage for media, so it was set up as a windows ReFS drive again, and exported as LUN1 to the same box LV2 was presented to (LV2 was LUN0 of course). Content of LV2 was then moved to LV3, and LV2 was deleted, freeing up the space. So, to illustrate again, storage looked like this:

+----------------+-------------------------+------------------------------------+ | LV1(2TB) | FREE SPACE | LV3 (13TB) |

+----------------+-------------------------+------------------------------------+

Now, this free space I decided to dedicate to some linux boxes (web server, DB storage, etc), and I decided to make 2 separate LVs from it, each being 3TB.

When I tried to create the first new LV, from the chunk of now free space, from TUI of the ESOS box, lvcreate got stuck forever. (well, for a good 4 hours before I killed it manually).

What I think happened, as the LVM tried to create LV from the free space, it encountered a GPT (GUID partition table) signature, since it would have been at the beginning sectors of the free space, remaining from previous LV. It then went to prompt for confirmation, but since it was invoked from a subshell (and output redirected to /dev/null :) it never got the answer it wanted, so it waited forever.

I was able to kill it from a separate ssh session, and then manually issue lvcreate, with the same params. This confirmed the problem. I did not complete the LV creation, but instead looked into it a bit, and found a similar problem from salt project. When I put -W on the command line, creation went through without a prompt.

Now, I am not sure if "-W" is appropriate, or if maybe some other behavior would be better here, like zeroing out the partition, or setting it in lvm.conf, or some other mechanism for continuing in these cases. Maybe ask user through TUI if they are sure?

I was just bringing up a scenario where lvcreate might get stuck when invoked from TUI. If you have a test env handy you could try to reproduce the issue and verify that I did not have a glitch of some sort. At this point I am not really willing to mess with a setup that is in daily use :)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/parodyne/esos/issues/225#issuecomment-445674996, or mute the thread https://github.com/notifications/unsubscribe-auth/AFkL3vvYq3nCbvacdtKCZ7wRq3WIaghbks5u3eM9gaJpZM4ZJ-T6 .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

msmith626 commented 5 years ago

Okay thanks, I'll leave this open for now. I hope to have some time in the coming weeks to be able to reproduce this and test the '--quiet' flag. If it returns an exit status of 0 when input is needed, then this is no good our situation. So we'll check that first. =)

--Marc

On Mon, Dec 10, 2018 at 11:47 AM mastergregor notifications@github.com wrote:

Quiet might be a good option - I wanted to suggest that lvcreate can fail, and then TUI can prompt for confirmation of a whipeout and do it again with -yes, but after some coffee, I think that would only open the door for potential strange behaviors from the unknown situations as this point. So I do like a suggestion to just error out, and let user do it manually, since sensitive systems like storage should be handled by people who know what they are doing anyhow. As for testing, it will be a month or two before I can test it properly, I am thinking of getting a sister enclosure for HA setup, but this has to wait for Q1-2019 funding, since I would need an FC switch at that time as well. It is a Christmas season anyhow, time for skiing and family :)

Gregor

On Monday, December 10, 2018, 9:48:54 AM CST, Marc A. Smith < notifications@github.com> wrote:

I agree the possible interactive behavior when running the commands needs to be solved, but I'm not sure I agree with using '--yes' to always wipe the device. I expect this could be safety catch, so probably don't want to wipe unconditionally.

What about if we just get it to return an error if this can't be done... then the user can try running at at the shell and determine how they'd like to handle it.

-q|--quiet Suppress output and log messages. Overrides -d and -v. Repeat once to also suppress any prompts with answer 'no'.

This option might be what we want in the TUI... answer "no" to any interactive prompts automatically. It would be ideal if this also returns a non-zero status. Any chance you could test this behavior on another non-production system?

--Marc

On Sun, Dec 9, 2018 at 11:29 PM mastergregor notifications@github.com wrote:

Sure, I can explain how I got to the situation where I think I needed it.

I created 2 LVs and set them up as targets to two separate FC links. I use straight card-to-card connections, no switched fabric for now. Both of these LVs were set up with vdisk_blockio, and set up on Windows as ReFS drives with a single partition each. LVs were 2TB (lets call this one LV1) and 6TB (lets call this one LV2). Since size is more than 2TB, GPT was used for both drives. Let me try some ASCII graphics to illustrate :)

+----------------+-------------------------+ | LV1(2TB) | LV2(6TB) | +----------------+-------------------------+

This setup worked quite nicely for a few months now, and is used in a small business setting, in production, for shared storage for documentation, DB and media files.

A few weeks ago, I got another 6 drives to populate in the enclosure, and now I had another 13TB of storage to allocate. I have set up another LV with 13TB of storage, let's call that one LV3. This was going to be a bigger storage for media, so it was set up as a windows ReFS drive again, and exported as LUN1 to the same box LV2 was presented to (LV2 was LUN0 of course). Content of LV2 was then moved to LV3, and LV2 was deleted, freeing up the space. So, to illustrate again, storage looked like this:

+----------------+-------------------------+------------------------------------+ | LV1(2TB) | FREE SPACE | LV3 (13TB) |

+----------------+-------------------------+------------------------------------+

Now, this free space I decided to dedicate to some linux boxes (web server, DB storage, etc), and I decided to make 2 separate LVs from it, each being 3TB.

When I tried to create the first new LV, from the chunk of now free space, from TUI of the ESOS box, lvcreate got stuck forever. (well, for a good 4 hours before I killed it manually).

What I think happened, as the LVM tried to create LV from the free space, it encountered a GPT (GUID partition table) signature, since it would have been at the beginning sectors of the free space, remaining from previous LV. It then went to prompt for confirmation, but since it was invoked from a subshell (and output redirected to /dev/null :) it never got the answer it wanted, so it waited forever.

I was able to kill it from a separate ssh session, and then manually issue lvcreate, with the same params. This confirmed the problem. I did not complete the LV creation, but instead looked into it a bit, and found a similar problem from salt project. When I put -W on the command line, creation went through without a prompt.

Now, I am not sure if "-W" is appropriate, or if maybe some other behavior would be better here, like zeroing out the partition, or setting it in lvm.conf, or some other mechanism for continuing in these cases. Maybe ask user through TUI if they are sure?

I was just bringing up a scenario where lvcreate might get stuck when invoked from TUI. If you have a test env handy you could try to reproduce the issue and verify that I did not have a glitch of some sort. At this point I am not really willing to mess with a setup that is in daily use :)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/parodyne/esos/issues/225#issuecomment-445674996, or mute the thread < https://github.com/notifications/unsubscribe-auth/AFkL3vvYq3nCbvacdtKCZ7wRq3WIaghbks5u3eM9gaJpZM4ZJ-T6

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/parodyne/esos/issues/225#issuecomment-445885439, or mute the thread https://github.com/notifications/unsubscribe-auth/AFkL3nSx-ydhOsag4xVXdEVnQt5vuG8tks5u3pASgaJpZM4ZJ-T6 .

mastergregor commented 5 years ago

I got around trying a few things, but I think results are not really what is wanted here:

lvcreatecapture

It looks like "-q -q" makes lvcreate perform as expected, and it completes "successfully" :)

I will keep this setup at it's present state for now, so we can try out different options for lvcreate when there is an existing signature on the volume space.

Ideas?

mastergregor commented 5 years ago

I did not really make it clear in previous post that "lvcreate -q -q" does create a logical volume, but it does not wipe the existing signature of a previous file system on it.

capture1

I feel this was an important part of information to supply :)

In my case I did not mind that there was an existing FS on it, since I was putting an ext4 partition on it anyhow, but others may want a different behavior in certain situations.

msmith626 commented 5 years ago

Thank you for your work in testing this. The "-q -q" makes lvcreate (or any LVM command) simply answer "no" when prompted. This may not be the best in every situation either, especially if it answers "no" to creating a logical volume (for something other than wiping data on the volume), and then volume doesn't get created, and it still exits with 0.

That said, I think forcing an answer of "no" for all LVM commands is probably better than us hanging at the shell waiting for input. But it would be nice to know other cases where the logical volume didn't actually get created if a "no" answer was forced, and we would want the command to exit non-zero.

Let me think about this a little more...

On Wed, Dec 26, 2018 at 11:42 AM mastergregor notifications@github.com wrote:

I did not really make it clear in previous post that "lvcreate -q -q" does create a logical volume, but it does not wipe the existing signature of a previous file system on it.

[image: capture1] https://user-images.githubusercontent.com/43911786/50451581-91a70880-08fa-11e9-9910-fda996b3873b.PNG

I feel this was an important part of information to supply :)

In my case I did not mind that there was an existing FS on it, since I was putting an ext4 partition on it anyhow, but others may want a different behavior in certain situations.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/parodyne/esos/issues/225#issuecomment-449991591, or mute the thread https://github.com/notifications/unsubscribe-auth/AFkL3mkKcT7emLGprWWzT5G1VoqWDG7eks5u86bxgaJpZM4ZJ-T6 .