What are the advantages of gathering facts beforehand ?

julienlavergne commented 2 years ago

Is your feature request related to a problem? Please describe

I have been modifying a local version of pyinfra in order to solve different issues (some of them posted as github issues here), and I came to the conclusion that there is not much advantages in gathering facts before operations. Actually, most of the limitations and weird behaviors I see would be solved by gathering facts along the way.

I would be interested to discuss to pros and cons and contribute in modifying the way pyinfra work if necessary.

Describe the solution you'd like

I have a small list of the advantages of gathering facts before operations that requires them:

Since there is no need to come up with a list of facts to gather beforehand, the whole problem of ordering operations goes away. Especially, the code that browse the stack to come up with operation order is unnecessary, and this code does not output the correct order in several cases.
There is no need for the preserve_loop_order magic anymore. It was anyway very counter-intuitive to have operations in loops not executed in the expected order.
Most usage of assume_present arguments becomes unnecessary since facts will reflect the correct state of the machine right before an operation is performed.
The execution flow is easier to understand, for example, simple things like creating a directory and checking if that directory exists would simply works.
The whole concept of nested operations is moot or becomes very limited. It seems the main use case was to run an operation, get the output and perform more operations based on the output. But with facts gathered along the way, the output of the command is available immediately and conditional logic can be written directly within the main python script (instead of the callback).
The is no need to support dynamic facts, since facts are dynamic by default.
We can still operate a cache to make sure facts are not gathered unnecessary and, just like ansible, provide ways to invalidate the cache for arbitrary operations like server.shell.

Fizzadar commented 2 years ago

Thank you for writing this up @julienlavergne - this has recently been on my mind also. I'm going to document the context of why this is the case below (this will be long I think :)), and give my thoughts.

Firstly, let's split the problem into two distinct parts:

Ordering operations correctly
Gathering facts on operations (pre-execution vs. execution)

Historical Context

Prior to v2, pyinfra relied on line-number ordering and pre-execution fact gathering to achieve it's high performance. The reason for this is that operations were generated on hosts sequentially, rather than in parallel. As facts were required, they were gathered in parallel on all hosts (whether or not they need the specific fact). For example:

# inventory.py
web_servers = ["web-01", "web-02"]
db_servers = ["db-01", "db-02"]

# deploy.py
if host in inventory.get_group("web_servers"):
    files.file(path="web-file")
else:
    files.file(path="db-file")

To generate the commands to execute the following would happen (v0.x, v1.x):

Connect to all 4 servers in parallel
For each server sequentially: a. execute deploy.py with host set to that server b. as facts are required (files.File), load them on all 4 servers in parallel
Now we have our operation order & commands to run
Execute all the operations in parallel across the hosts

Because facts were loaded in parallel (2.a), each iteration of 2 got quicker and quicker as most/all facts were pre-cached for the current host. This is why facts were(are) gathered before execution.

The above example also highlights why operation order is generated from line numbers - because the same code (deploy.py) is executed for multiple hosts, the order in which operations are called is inconsistent. In the above example the user expects execution flow to behave like:

Execute files.file(path=web-file) on web-01 & web-02
Wait for completion of 1
Execute files.file(path=db-file) on db-01 & db-01

By taking operation order as they are called this would not be possible. Note: this does not affect deploys against a single host target, where operation call order would work.

What can we do now?

Back to v2 and your points above, I'll split my thoughts into the two problems above:

Operation ordering

Since there is no need to come up with a list of facts to gather beforehand, the whole problem of ordering operations goes away. Especially, the code that browse the stack to come up with operation order is unnecessary, and this code does not output the correct order in several cases.

There is no need for the preserve_loop_order magic anymore. It was anyway very counter-intuitive to have operations in loops not executed in the expected order.

Unfortunately I don't think we can avoid this without breaking operation execution flow, particularly where there are multiple code paths for different hosts involved in a deploy. The line/stack ordering enforces "correct" ordering - except loops and context processors. The general assumption being that deploy files are generally "simplified Python" consisting of operation calls, conditional statements and functions. I'm not a fan of this gotcha and would be keen to investigate alternatives!

While I don't see a way to remove the line ordering mechanism, I would like to have it automatically handle loops and context processors if possible. In v0.x pyinfra would modify the ast of deploy code before execution to achieve ordering without line numbers and that may be a workable solution. Alternatively it might be possible to modify the loop detection code to automatically re-order them as expected.

Fact gathering

The whole concept of nested operations is moot or becomes very limited. It seems the main use case was to run an operation, get the output and perform more operations based on the output. But with facts gathered along the way, the output of the command is available immediately and conditional logic can be written directly within the main python script (instead of the callback).

Because of the operation ordering issue, it's still not possible to provide output from an operation immediately. The deploy code must be run once before any operations are actually executed to generate the order, which unfortunately makes it impossible to have the output included.

Most usage of assume_present arguments becomes unnecessary since facts will reflect the correct state of the machine right before an operation is performed.

I would absolutely love to remove this, it's a real pain and a massive gotcha. v2 makes it entirely possible to do by having operations (re)collect facts at execution time. The only drawback is the list of changes pre-execution may not be correct; ie if you do a dry run deploy first you expect the number of commands proposed to match those executed, and collecting facts at execution may break this. One option could be to display "up to X" commands per operation, because we can make reasonable assumptions that certain facts will change (files) and others will not (system OS).

Thoughts

Collecting some thoughts below on the more general philosophy of pyinfra and how it works.

I do think the "dry run" pyinfra offers is a powerful tool that has a lot of unused potential. On a basic level pyinfra could support terraform style approval steps. Even more interesting would be the idea of creating a diff file that can then be moved somewhere else for execution - pyinfra needn't even be the tool doing the execution.

The whole two-stage deploy mechanism has consistently provided complexity over the last 7(!) years, but has also enabled writing almost-normal Python code to generate operations that execute in a similar way to tools like Ansible. I've yet to encounter something that wasn't possible (but have seen things not possible in other tools). Examples & documentation would help a lot here I think.

Today pyinfra seems to be a hybrid of a Ansible/SaltStack-like mostly-state-base ddeployment tool and Fabric/Parallel-SSH command execution tool. This is definitely both an advantage in terms of high flexibility but also a disadvantage because it comes with some gotchas that make it "almost like Python" at times.

I hope this provides some context, please let me know if anything doesn't make sense and would love to hear thoughts from any pyinfra users on the above. Ultimately I think any changes to these systems are on the table assuming enough support and technical possibility :)

julienlavergne commented 2 years ago

Coming back on your example:

Connect to all 4 servers in parallel

For each server sequentially: a. execute deploy.py with host set to that server b. as facts are required (files.File), load them on all 4 servers in parallel

Now we have our operation order & commands to run

Execute all the operations in parallel across the hosts

The flow I am referring to is the following:

In parallel on all 4 servers:

Connect to the host
Execute deploy.py, which gather facts and run commands along the way

If I take the different points you mention, I think they can be supported with this flow:

Dry Run: Seems to be trivially supported by running the deploy.py without executing the commands. You can even have a dry_run variable set to true for that purpose that user can inspect in order to adjust the execution in case of dry run.
Speed: It is not slower than the current flow, since everything is done in parallel.
Multiple code paths for different hosts: It does no matter what deploy.py contains for different hosts, the execution is totally independent on every hosts.
Operation order is correct per host according to expected python execution order, and it supports all python features. There is no attempt to synchronize operations across hosts except at the end, if one host finish before others, very well.

Fizzadar commented 2 years ago

Multiple code paths for different hosts: It does no matter what deploy.py contains for different hosts, the execution is totally independent on every hosts. Operation order is correct per host according to expected python execution order, and it supports all python features. There is no attempt to synchronize operations across hosts except at the end, if one host finish before others, very well.

The problem with doing this is it breaks a number of scenarios in which the order across multiple costs is essential. Really anything involving a control / worker node setup. For example bootstrapping an Elasticsearch cluster might look like:

Install ES apt package on all four nodes control-1, data-1, data-2, data-3
Setup config on all fur nodes
Start ES with some init config on control-1
Wait for this to complete, then start ES on the three data nodes

There are many similar examples of this where operations on one node must complete before operations on another. The operation order as it currently exists handles this.

julienlavergne commented 2 years ago

Pyinfra is not the right tool for that. I am doing similar things on my side, and there is better ways to do it than pyinfra.

Setup/configuration and actually deploying/running an application are two different things that comes with their own challenges and pyinfra is not equipped at all to deal with the challenges of bringing a cluster of applications up.

Typically, bringing up an ES cluster need to deal with scenarios where you add/remove nodes to the cluster, handle failure cases if the master fails, rollback to a previous working state etc..

Regardless, if such a case happen, pyinfra could expose an interface to synchronize the execution across all hosts at any time during execution. asyncio does provide the necessary synchronization mechanisms for that. I imagine this is not going to be a very popular feature. I have the feeling that mass configuration of multiple independent host in parallel is much more common.

Fizzadar commented 2 years ago

I can absolutely say pyinfra works incredibly well for the ES use case, it's one of the places it was first used at scale within a company environment to manage tens of large (300+ node) clusters.

I am aware of its use in production today for all sorts of setups including ES, MariaDB and Kubernetes clusters. The operation ordering is relied upon to provide consistent idempotent deployment of these services. Because of this ops like adding/removing nodes can be achieved simply by updating the inventory and re-running.

julienlavergne commented 2 years ago

To come back to the original question. What can be done when splitting the execution in 2 steps that cannot be done without this split ? I think it important to answer that because this split comes at the great cost of:

not being able to react dynamically to output and exit code of operations.
having things like preserve_loop_order and assume_present
not supporting some native python constructs with loops, context manager, exceptions
having to develop some specific features to workaround it like dynamic facts

ubipo commented 2 years ago

I think at least the following would be difficult with single-run execution:

the 'same'[1] operation on each host:
- cannot be displayed as a group in the summary output (a)
- cannot be interactively approved/disapproved as a group (b)
synchronization of inter-host dependent operation is more difficult (c)
generating a diff file for later execution (d)

The 'same'[1] operation cannot be grouped in the output (a)

We could use some command line ANSI escape code magic to update previously printed operations for hosts that have now also hit them. This might look something like:

--> Starting operation: Apt/Packages (packages=['vim'])
    [host-1] No changes
    [host-2] *executing* <insert loading animation here>

--> Starting operation: Apt/Packages (packages=['vim'], present=False)
    [host-1] *executing* <insert loading animation here>
    [host-2] *waiting for execution* <insert loading animation here>

...2 seconds later...

--> Starting operation: Apt/Packages (packages=['vim'])
    [host-1] No changes
    [host-2] Success  <-- this line changed even though it was already printed

--> Starting operation: Apt/Packages (packages=['vim'], present=False)
    [host-1] Success  <-- same here
    [host-2] Success  <-- and here

The 'same'[1] operation cannot be interactively approved/disapproved as a group (b)

While not ideal, this could be solved with a synchronization key:

burn_server_op_name = 'Set server on fire'
if host in inventory.get_group("flammable_nodes"):
  pyinfra.operations.server.burn_server(name=burn_server_op_name, appoval_needed=True)
else:
  pyinfra.operations.skip(name=burn_server_op_name)

--> Starting operation: Burn server (key='Set server on fire')
    Do you want to continue with 'Set server on fire' with the following changes:
    [host-1] Success: <some information about the planned execution>
    [host-2] Success: <some information about the planned execution>
    [host-3] Skipped
    Continue? [y/N]

The key could just be name, or it could be something separate like key or sync_key.

Inter-host dependent operations (c)

As @julienlavergne pointed out this could be solved using a synchronization API:

if host in inventory.get_group("control_nodes"):
  # control stuff (1)
else:
  # data stuff

pyinfra.wait(key = "Wait for control-1")

if host not in inventory.get_group("control_nodes"):
  # data stuff that depends on (1) ("start ES on the three data nodes")

asyncio.Event doesn't need key because it's used in the same execution context. The different host runs of pyinfra however do not share the same context (I think?).

I think using a key is fine, but we could move per-host code into the same context like so:

control_done_event = asyncio.Event()

async def deploy(host):
  if host in inventory.get_group("control_nodes"):
    # control stuff (1)
    control_done_event.set()
  else:
    # data stuff
    await control_done_event.wait()
    # data stuff that depends on (1) ("start ES on the three data nodes")

Where this file is executed once and then deploy is called for each host. This calling for each host could event be delegated to the the deploy file itself:

await asyncio.gather(deploy(host) for host in hosts_to_execute_magic_variable)

Although I don't know how pyinfra would be able to figure out which host an operation is being executed for.

Generating a diff file for later execution (d)

For this one I cannot really think of a solution.

[1] I guess operations called with different arguments are considered the same.

P.S. Amazing project, thanks for all the hard work! :heart:

mvgijssel commented 1 year ago

Just ran into the limitation of the two-phase deployment as well trying to use a custom fact which relies on a tool being installed in the same deployment here (https://github.com/mvgijssel/setup/pull/288/files#r1205699420). Also was expecting host.reload_fact(...) to solve this for me, but digging a bit further I guess it makes sense this method doesn't solve it.

If you are contemplating changing the execution strategy maybe you need to take a look at how Pulumi does it. It's similar to Terraform, but then also writeable in Python 🎉. Basically turning all operations into promises and executing stuff once all dependencies are resolved, giving you a N-phase deployment!

link2xt commented 1 year ago

Just filed an issue about the fact not being updated properly by the apt operation: #987 It seems that fact-gathering makes things more stateful and causes this kind of consistency problems between the "facts" and the real system.

Fizzadar commented 1 year ago

I’m convinced, it has become clear the disadvantages and confusion of the current execution strategy outweigh the advantages.

To remedy this I am proposing that pyinfra v3 will switch to executing operations and facts live. It will also be able to retain diff functionality simply by collecting facts twice if so desired. I’ve started working on this in #996, is working but tests/etc are not.

Fizzadar commented 5 months ago

v3 implements this, beta is out and pending full release imminently.

pyinfra-dev / pyinfra