openebs / lvm-localpv

Dynamically provision Stateful Persistent Node-Local Volumes & Filesystems for Kubernetes that is integrated with a backend LVM2 data storage stack.
Apache License 2.0
251 stars 96 forks source link

Issue with JSON decode on vgs combined stream ouput #247

Closed kro-cat closed 1 year ago

kro-cat commented 1 year ago

I recently ran into a bug which is causing the lvm driver to crash unexpectedly on my systems.

I've traced it to pkg/lvm/lvm_util.go line 666, in ListLVMVolumeGroup()

STDERR and STDOUT are combined via cmd.CombinedOutput(). The vgs program may produce non-critical warning messages on STDERR which don't affect the operation of this driver. Warning messages also don't affect the return code of the vgs program. Any warning messages printed to STDERR will fail the JSON decode step in decodeVgsJSON().

kro-cat commented 1 year ago

Quick and dirty implementation I did to mitigate this temporarily.

diff --git a/pkg/lvm/lvm_util.go b/pkg/lvm/lvm_util.go
index 0bb4877..e9b52ed 100644
--- a/pkg/lvm/lvm_util.go
+++ b/pkg/lvm/lvm_util.go
@@ -17,6 +17,7 @@ limitations under the License.
 package lvm

 import (
+       "bytes"
        "encoding/json"
        "fmt"
        "os"
@@ -663,11 +664,16 @@ func ListLVMVolumeGroup(reloadCache bool) ([]apis.VolumeGroup, error) {
                "--units", "b",
        }
        cmd := exec.Command(VGList, args...)
-       output, err := cmd.CombinedOutput()
+       var execOut bytes.Buffer
+       // var execErr bytes.Buffer
+       cmd.Stdout = &execOut
+       // cmd.Stderr = &execErr // do something with this useful information
+       err := cmd.Run()
        if err != nil {
                klog.Errorf("lvm: list volume group cmd %v: %v", args, err)
                return nil, err
        }
+       output := execOut.Bytes()
        return decodeVgsJSON(output)
 }
abhilashshetty04 commented 1 year ago

@kro-cat Thank you for reporting the issue. Can it be reproduced easily? If yes, please help us with the steps.

Since you have found the fix for it also. Would you like to raise a PR for this?

kro-cat commented 1 year ago

@abhilashshetty04 Of course! I'll have a PR up in a bit.

I'm convinced this is a niche issue. My current hypothesis is that it's due to the fact that the host machine has a different lvm system_id than the lvm-driver container. When lvm detects a foreign system id on a volume group, it will print a warning on STDERR; this is the kind of message that led me to this issue.

Since the system id can be set when using vgcreate, this shouldn't necessarily cause problems for volume groups meant to be managed by lvm-localpv so long as I'm aware of the correct system id to set (I think, by default, it's an empty string).

I haven't tested that yet, so I can't be certain that it's entirely correct.

kro-cat commented 1 year ago

By "in a bit" I guess I really meant to say "when I'm confident it actually solves the problem I'm having instead of just hiding it"

abhilashshetty04 commented 1 year ago

@kro-cat , Thanks for replying. We pre-provision vg (volume group) anyway right? So what do you mean by lvm-driver detects vg not managed by itself?

kro-cat commented 1 year ago

@abhilashshetty04 The daemonset can see the any of the host's lvm devices due to its /dev hostPath mount.

I'll try my best to explain why that might cause a problem.

In my particular use case, the host uses lvm logical volumes for its own filesystem (/var, /home, stuff like that). These volume groups cannot be managed by lvm-localpv, so I don't include them in its configuration. Additionally, the host's lvm.conf has been configured beyond defaults; notably, the hosts each have a unique systemid configured. These are the kinds of volumes whose volume groups create this issue.

When the lvm-driver pod executes vgs, it's intention is to discover the volume groups it's configured to manage. However, because the host's own volume group (and lvm.conf) are configured outside of lvm-driver, this has the potential to cause vgs (which shows all volume groups in /dev) to display warnings or errors about the differences or incompatibilities (foreign systemid, no lvmlockd, and others).

kro-cat commented 1 year ago

I think I should follow-up on my previous comment. The issue here is that lvm-driver crashes unexpectedly when warning messages are mixed with JSON output from vgs. I've put a lot of effort into explaining why the warnings messages are produced, but that's not the point of this issue, nor is that necessarily an issue with lvm-localpv itself. The PR for this issue addresses the unexpected crashing: split the STDERR output from vgs and print it to the log. This provides much-needed diagnostic information while not affecting the health of the daemonset.

Mjolkhare commented 1 year ago

@kro-cat @Abhinandan-Purkait I have the same issue with lvs and pvs. I think all commands should use RunCommandSplit instead of CombindOutput.

kro-cat commented 1 year ago

@kro-cat @Abhinandan-Purkait I have the same issue with lvs and pvs. I think all commands should use RunCommandSplit instead of CombindOutput.

@Mjolkhare I think I'll set up a PR this weekend for that since I never did that refactor in pull #250, unless of course someone beats me to it.

Abhinandan-Purkait commented 1 year ago

@kro-cat Thanks for being on top of this.