openshift / cluster-api-provider-libvirt

Apache License 2.0
36 stars 56 forks source link

BUG 1855288: fix memory leak that leads to unresponsive libvirt URI #199

Closed dbenoit17 closed 4 years ago

dbenoit17 commented 4 years ago

The libvirt C library will maintain open connections against the URI based on an internal reference counting scheme. This pull request adds calls to Free() for a storage pool and a network object which otherwise cause unused connections to linger and stack, causing an unresponsive URI at scale.

dbenoit17 commented 4 years ago

/retest

dbenoit17 commented 4 years ago

/retest

dbenoit17 commented 4 years ago

/retest

Prashanth684 commented 4 years ago

/lgtm

Prashanth684 commented 4 years ago

/cc @praveenkumar @cfergeau This is needed for our multi-arch CI which runs on libvirt. We were noticing that after running a few jobs the libvirt daemon would hang because of these open connections.

openshift-ci-robot commented 4 years ago

@dbenoit17: This pull request references Bugzilla bug 1855288, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug * bug is open, matching expected state (open) * bug target release (4.6.0) matches configured target release for branch (4.6.0) * bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
In response to [this](https://github.com/openshift/cluster-api-provider-libvirt/pull/199): >BUG 1855288: fix memory leak that leads to unresponsive libvirt URI Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
enxebre commented 4 years ago

/bugzilla resfresh

Prashanth684 commented 4 years ago

/cherry-pick release-4.5 /cherry-pick release-4.4 /cherry-pick release-4.3 /cherry-pick release-4.2

openshift-cherrypick-robot commented 4 years ago

@Prashanth684: once the present PR merges, I will cherry-pick it on top of release-4.5 in a new PR and assign it to you.

In response to [this](https://github.com/openshift/cluster-api-provider-libvirt/pull/199#issuecomment-656172491): >/cherry-pick release-4.5 >/cherry-pick release-4.4 >/cherry-pick release-4.3 >/cherry-pick release-4.2 Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
Prashanth684 commented 4 years ago

Looking over this file, it seems there are more places where we leak pools

diff --git a/pkg/cloud/libvirt/client/client.go b/pkg/cloud/libvirt/client/client.go
index 7ff8af70..1aa2a546 100644
--- a/pkg/cloud/libvirt/client/client.go
+++ b/pkg/cloud/libvirt/client/client.go
@@ -194,6 +194,7 @@ func (client *libvirtClient) CreateDomain(ctx context.Context, input CreateDomai
        if err != nil {
                return fmt.Errorf("can't retrieve volume %s for pool %s: %v", input.VolumeName, client.poolName, err)
        }
+       defer diskVolume.Free()
        if err := setDisks(&domainDef, diskVolume); err != nil {
                return fmt.Errorf("Failed to setDisks: %s", err)
        }
@@ -214,6 +215,7 @@ func (client *libvirtClient) CreateDomain(ctx context.Context, input CreateDomai
                if err != nil {
                        return fmt.Errorf("error getting ignition volume: %v", err)
                }
+               defer ignVolume.Free()
                ignVolumePath, err := ignVolume.GetPath()
                if err != nil {
                        return fmt.Errorf("error getting ignition volume path: %v", err)
@@ -387,6 +389,8 @@ func (client *libvirtClient) CreateVolume(input CreateVolumeInput) error {
        if err == nil {
                return fmt.Errorf("storage volume '%s' already exists", input.VolumeName)
        }
+       /* probably not quite right in the input.BaseVolumeName == "" case */
+       defer volume.Free()

        volumeDef := newDefVolume(input.VolumeName)
        volumeDef.Target.Format.Type = input.VolumeFormat
@@ -417,6 +421,7 @@ func (client *libvirtClient) CreateVolume(input CreateVolumeInput) error {
                if err != nil {
                        return fmt.Errorf("Can't retrieve volume %s", input.BaseVolumeName)
                }
+               defer baseVolume.Free()
                var baseVolumeInfo *libvirt.StorageVolInfo
                baseVolumeInfo, err = baseVolume.GetInfo()
                if err != nil {

(the CreateVolume one probably can't be used as is).

This PR is mentioning memory leaks, but I was under the impression that the main issue was that too many sockets to libvirtd were opened? Or was the number of sockets just a symptom, and the real problem the amount of memory consumed on the system?

There would be socket objects lingering because the memory was not freed, so the sockets would be kept open causing the daemon to be unresponsive. Freeing this memory also closed the socket. So, in that sense it is a leak. But in terms of "memory" as such I think it was fine. So, these memory leaks are good to cleanup, but might not impact as much.

cfergeau commented 4 years ago

There would be socket objects lingering because the memory was not freed, so the sockets would be kept open. Freeing this memory also closed the socket. So, in that sense it is a leak. But in terms of "memory" as such I think it was fine. So, these memory leaks are good to cleanup, but might not impact as much.

Yes, freeing the object will release the references they keep to the libvirt connection, which will allow this connection to be freed, which will close the sockets. If the issue really was that too many sockets were opened, it would be useful to mention this in the commit log.

dbenoit17 commented 4 years ago

/retest

dbenoit17 commented 4 years ago

/retest

Prashanth684 commented 4 years ago

/test e2e-libvirt

Prashanth684 commented 4 years ago

/test e2e-libvirt

Prashanth684 commented 4 years ago

/test e2e-libvirt

dbenoit17 commented 4 years ago

/test e2e-libvirt

dbenoit17 commented 4 years ago

/test e2e-libvirt

dbenoit17 commented 4 years ago

/test e2e-libvirt

dbenoit17 commented 4 years ago

/test e2e-libvirt

dbenoit17 commented 4 years ago

/test e2e-libvirt

jaypoulz commented 4 years ago

/test e2e-libvirt

cfergeau commented 4 years ago

/test e2e-libvirt

cfergeau commented 4 years ago

There would be socket objects lingering because the memory was not freed, so the sockets would be kept open. Freeing this memory also closed the socket. So, in that sense it is a leak. But in terms of "memory" as such I think it was fine. So, these memory leaks are good to cleanup, but might not impact as much.

Yes, freeing the object will release the references they keep to the libvirt connection, which will allow this connection to be freed, which will close the sockets. If the issue really was that too many sockets were opened, it would be useful to mention this in the commit log.

Can we have a real commit log for the 'leak fixes' commit which goes along the lines of this thread?

andymcc commented 4 years ago

/test e2e-libvirt

cfergeau commented 4 years ago

/lgtm /approve

openshift-ci-robot commented 4 years ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cfergeau

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openshift/cluster-api-provider-libvirt/blob/master/OWNERS)~~ [cfergeau] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
openshift-ci-robot commented 4 years ago

@dbenoit17: All pull requests linked via external trackers have merged: openshift/cluster-api-provider-libvirt#199. Bugzilla bug 1855288 has been moved to the MODIFIED state.

In response to [this](https://github.com/openshift/cluster-api-provider-libvirt/pull/199): >BUG 1855288: fix memory leak that leads to unresponsive libvirt URI Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
openshift-cherrypick-robot commented 4 years ago

@Prashanth684: new pull request created: #202

In response to [this](https://github.com/openshift/cluster-api-provider-libvirt/pull/199#issuecomment-656172491): >/cherry-pick release-4.5 >/cherry-pick release-4.4 >/cherry-pick release-4.3 >/cherry-pick release-4.2 Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
sdodson commented 4 years ago

/cherrypick release-4.4

openshift-cherrypick-robot commented 4 years ago

@sdodson: new pull request created: #204

In response to [this](https://github.com/openshift/cluster-api-provider-libvirt/pull/199#issuecomment-661966272): >/cherrypick release-4.4 Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
sdodson commented 4 years ago

/cherrypick release-4.3

openshift-cherrypick-robot commented 4 years ago

@sdodson: new pull request created: #205

In response to [this](https://github.com/openshift/cluster-api-provider-libvirt/pull/199#issuecomment-661974492): >/cherrypick release-4.3 Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.