neoave / mrack

Multicloud use-case based multihost async provisioner for CIs and testing during development
Apache License 2.0
12 stars 14 forks source link

Issue: Test inventory is not produced until machines are checked to be alive which complicates debugging #236

Open jakub-vavra-cz opened 1 year ago

jakub-vavra-cz commented 1 year ago

Mrack does not produce test inventory until it is sure that the machines are alive (using ssh/winrm). The issue is that when machines are provisioned but not accessible (wrong credentials for example) the mrack (te) is stuck in provisioning and when killed by CTRL+C there is no inventory to use for teardown to get rid of the machines. Also when there is no test inventory debugging of the issues is more complicated as one need to fish the machine info from mrack.log

Would it be possible to shift the inventory creation to the step where: "All hosts reached provisioning final state (ACTIVE or ERROR)"?

Tiboris commented 1 year ago

Hey @jakub-vavra-cz thanks for showing us your struggle we will discuss this in team but i believe the check has its meaning. However if you wish you can disable the feature for your local runs with setting post_provisioning_check values in provisioning-config.yaml file of yours:

post_provisioning_check:
    ssh:
        # Default configurations for every host
        enabled: True # True | False
        disabled_providers: ["podman"] # Per provider override to `enabled: True`
        enabled_providers: [] # Would be relevant if 'enabled' is 'False'
        port: 22
        timeout: 10 # minutes

        # Overrides

        # Groups

        # group:
        #     ad:
        #         timeout: 20 # minutes
        #         # enabled: False # an example disabling check for ad group

        # If we want to override based on OS
        os:
            win-2012r2:
                timeout: 15 # minutes
                # enabled: False  # an example to disable for distro
            win-2016:
                timeout: 15 # minutes
            win-2019:
                timeout: 15 # minutes
            win-2012r2-latest:
                timeout: 20 # minutes
            win-2016-latest:
                timeout: 20 # minutes
            win-2019-latest:
                timeout: 20 # minutes
            win-2022-latest:
                timeout: 20 # minutes
            # fedora-34:  # an example
            #     enabled: False
            #     timeout: 1
            #     enabled_providers: []
            #     disabled_providers: ["static"]

TLDR; you can disable check with:

post_provisioning_check:
    ssh:
        # Default configurations for every host
        enabled: False

NOTE: Also if you let mrack finish it will cleanup reserved resources when ssh check is failing

jakub-vavra-cz commented 1 year ago

The check has definitely a meaning and is useful. I was just thinking that it can be done after the inventory is written. As for letting the mrack finish, I either want to debug the issue, so having inventory would be useful, or I want to give up and drop the machines without waiting 20minutes to try again.

pvoborni commented 1 year ago

Debugging this can be done while mrack is still running without the inventory. Open a different terminal a mrack.log and there is a full ssh command mrack is executing. This can be then run manually, e.g. with more verbose output.

pvoborni commented 1 year ago

That said, I think there is kinda implied more serious bug in the described behavior: mrack doesn't even try to clean the vms if killed by interrupt signal.

dav-pascual commented 1 week ago

@jakub-vavra-cz Reviewing issues. Is this still relevant for you?

jakub-vavra-cz commented 1 week ago

Yes, I still consider lack if inventory and machines hanging around without way of deprovision as a serious defect in the design.