oVirt / vdsm

The Virtual Desktop Server Manager
GNU General Public License v2.0
160 stars 201 forks source link

vdsm: get gluster volume info from any gluster peer #412

Closed josgutie closed 4 months ago

josgutie commented 5 months ago

The function _get_gluster_volinfo query the glusterfs volume info the the storage server, this is translated to the gluster client adding the parameter --remote-host which limits the query to one server, so we are converting the storage server as a single point of failure, if it is not available, it can led to cluster outtage. The proposed changed let the cluster cli to use any available gluster peer.

josgutie commented 5 months ago

Why did you send a new PR when we have #408?

You need to answer my questions in #408. Please close this PR so we can continue the review on the existing PR.

Because the PR was bad formed, and I guess that it's more clean this PR.

Here is your question in the other PR,

I think this was added years ago because not specifying the server could be a problem. Not sure that the gluster cluster can cope with a case when one of the server is down.

Gluster cluster can cope with one of the server down, the issue is triggering because adding the --remote-host to the gluster cli limits the gluster query to this host, avoiding querying to the backup servers. Maybe it's a mechanism to avoid to register not updated info.

_Please check git history to understand when and why self.volfileserver was added.

In the git history the function was always called with the sef._volfileserver, as you can see here https://github.com/oVirt/vdsm/commit/907560b766835e585a162aa8b36d786e17f62384.

Also how did you test this change? Please describe what was tested - how you simulated the issue and tested that this change solves the issue, and how you tested that this change does not cause regressions in other flows.

Regarding the tests, I will describe in a new comment if you don't mind.

nirs commented 5 months ago

@josgutie ok lets continue here, but please do not close PRs and submit new one like this. If there was an issue with the previous PR it can be easily fixed by pushing a new version.

josgutie commented 5 months ago

$ grep -E "storage|mnt_options" /etc/ovirt-hosted-engine/hosted-engine.conf storage=rhhi03.storage01.lab:/engine mnt_options=backup-volfile-servers=rhhi01.storage01.lab:rhhi02.storage01.lab

$ df -h | grep glusterSD rhhi03.storage01.lab:/engine 100G 14G 87G 14% /rhev/data-center/mnt/glusterSD/rhhi03.storage01.lab:_engine rhhi03.storage01.lab:/vmstore 150G 2.6G 148G 2% /rhev/data-center/mnt/glusterSD/rhhi03.storage01.lab:_vmstore rhhi01.storage01.lab:/data 50G 908M 50G 2% /rhev/data-center/mnt/glusterSD/rhhi01.storage01.lab:_data

josgutie commented 5 months ago
nirs commented 5 months ago

@nirs suggested testing a new deployment, it will take some time, but I guess that I could check the scenario.

Test looks good, and sure we want to test the common operations like creating a new storage domain, putting storage domain to maintenance and activating it.

But I think you change is not complete - it assumes that backup-volfile-servers= is configured with all the nodes. In this case it seems that gluster does use any of the nodes and does not need the volfileserver argument (otherwise it would not work). But if I remember correctly, backup-volfile-servers= is optional on engine side, so your change must handle the case when it is not configured on engine side.

So the logic seems to be:

if backup_volfile_servers:
    ommit the volfileserver argument
else:
    use the volfileserver argument

Also not using backup-volfile-servers seems very fragile, so we can consider warning about it when connecting to the storage.

josgutie commented 5 months ago

@nirs the gluster command doesn't have any parameter to point to the backup servers, you can review the following client documentation Gluster Command Line Interface. In fact it's hard to find the definition of the --remote-host gluster client parameter.

Regarding the suggested change, the problem is if the volfile server is down, the call will fail because --remote-host limits the call to the volfile server and omit calling to the backup servers.

We could deal with connection issue in that part of the code, but from my point of view, it's a gluster cli responsibility not vdsm.

I will create and destroy a gluster storage domain.

josgutie commented 5 months ago

New tests completed:

The only thing left to test is to perform a full RHHI 1.8 deployment.

josgutie commented 4 months ago

Hi, A new RHHI 1.8 deployment test completed. Scenario:

In this new deployment I've created a new brick, glusterfs volume and glusterfs storage domain.

If you need further details of the tests performed, do not hesitate to ask.

aesteve-rh commented 4 months ago

/ost

aesteve-rh commented 4 months ago

@josgutie Thanks, your test battery looks correct. @nirs wdyt? do you still have concerns regarding the update?

@josgutie Could you check the module tests that are failing in the pipelines? Probably expectations changed after the update.

aesteve-rh commented 4 months ago

/ost

aesteve-rh commented 4 months ago

@josgutie linter failed, mind the longer lines. Limit is 79 characters.

I was also wondering if you could add a small test that throws an exceptions and triggers the retry with the backup servers. So that at least we add some coverage for the new code.

aesteve-rh commented 4 months ago

/ost

aesteve-rh commented 4 months ago

/ost

aesteve-rh commented 4 months ago

/ost

aesteve-rh commented 4 months ago

/ost

tinez commented 4 months ago

/ost el8stream basic-suite-master

tinez commented 4 months ago

/ost basic-suite-master el8stream

michalskrivanek commented 4 months ago

el8stream OST is not working, el9stream should pass

michalskrivanek commented 4 months ago

/ost basic-suite-master el9stream

aesteve-rh commented 4 months ago

/ost

aesteve-rh commented 4 months ago

@nirs do you have any comment on this PR? Otherwise, I will integrate.

josgutie commented 4 months ago

Hi Nir, PR is updated, my doubt in on the test coverage, but it should work in both cases, where backup server are defined and not.

aesteve-rh commented 4 months ago

/ost

aesteve-rh commented 4 months ago

/ost

aesteve-rh commented 4 months ago

/ost