Closed julienlavergne closed 5 months ago
Thank you for writing this up @julienlavergne - this has recently been on my mind also. I'm going to document the context of why this is the case below (this will be long I think :)), and give my thoughts.
Firstly, let's split the problem into two distinct parts:
Prior to v2, pyinfra relied on line-number ordering and pre-execution fact gathering to achieve it's high performance. The reason for this is that operations were generated on hosts sequentially, rather than in parallel. As facts were required, they were gathered in parallel on all hosts (whether or not they need the specific fact). For example:
# inventory.py
web_servers = ["web-01", "web-02"]
db_servers = ["db-01", "db-02"]
# deploy.py
if host in inventory.get_group("web_servers"):
files.file(path="web-file")
else:
files.file(path="db-file")
To generate the commands to execute the following would happen (v0.x, v1.x):
deploy.py
with host
set to that server
b. as facts are required (files.File
), load them on all 4 servers in parallelBecause facts were loaded in parallel (2.a), each iteration of 2 got quicker and quicker as most/all facts were pre-cached for the current host. This is why facts were(are) gathered before execution.
The above example also highlights why operation order is generated from line numbers - because the same code (deploy.py
) is executed for multiple hosts, the order in which operations are called is inconsistent. In the above example the user expects execution flow to behave like:
files.file(path=web-file)
on web-01 & web-02files.file(path=db-file)
on db-01 & db-01By taking operation order as they are called this would not be possible. Note: this does not affect deploys against a single host target, where operation call order would work.
Back to v2 and your points above, I'll split my thoughts into the two problems above:
- Since there is no need to come up with a list of facts to gather beforehand, the whole problem of ordering operations goes away. Especially, the code that browse the stack to come up with operation order is unnecessary, and this code does not output the correct order in several cases.
- There is no need for the
preserve_loop_order
magic anymore. It was anyway very counter-intuitive to have operations in loops not executed in the expected order.
Unfortunately I don't think we can avoid this without breaking operation execution flow, particularly where there are multiple code paths for different hosts involved in a deploy. The line/stack ordering enforces "correct" ordering - except loops and context processors. The general assumption being that deploy files are generally "simplified Python" consisting of operation calls, conditional statements and functions. I'm not a fan of this gotcha and would be keen to investigate alternatives!
While I don't see a way to remove the line ordering mechanism, I would like to have it automatically handle loops and context processors if possible. In v0.x pyinfra would modify the ast
of deploy code before execution to achieve ordering without line numbers and that may be a workable solution. Alternatively it might be possible to modify the loop detection code to automatically re-order them as expected.
- The whole concept of nested operations is moot or becomes very limited. It seems the main use case was to run an operation, get the output and perform more operations based on the output. But with facts gathered along the way, the output of the command is available immediately and conditional logic can be written directly within the main python script (instead of the callback).
Because of the operation ordering issue, it's still not possible to provide output from an operation immediately. The deploy code must be run once before any operations are actually executed to generate the order, which unfortunately makes it impossible to have the output included.
- Most usage of
assume_present
arguments becomes unnecessary since facts will reflect the correct state of the machine right before an operation is performed.
I would absolutely love to remove this, it's a real pain and a massive gotcha. v2 makes it entirely possible to do by having operations (re)collect facts at execution time. The only drawback is the list of changes pre-execution may not be correct; ie if you do a dry run deploy first you expect the number of commands proposed to match those executed, and collecting facts at execution may break this. One option could be to display "up to X" commands per operation, because we can make reasonable assumptions that certain facts will change (files) and others will not (system OS).
Collecting some thoughts below on the more general philosophy of pyinfra and how it works.
I do think the "dry run" pyinfra offers is a powerful tool that has a lot of unused potential. On a basic level pyinfra could support terraform style approval steps. Even more interesting would be the idea of creating a diff file that can then be moved somewhere else for execution - pyinfra needn't even be the tool doing the execution.
The whole two-stage deploy mechanism has consistently provided complexity over the last 7(!) years, but has also enabled writing almost-normal Python code to generate operations that execute in a similar way to tools like Ansible. I've yet to encounter something that wasn't possible (but have seen things not possible in other tools). Examples & documentation would help a lot here I think.
Today pyinfra seems to be a hybrid of a Ansible/SaltStack-like mostly-state-base ddeployment tool and Fabric/Parallel-SSH command execution tool. This is definitely both an advantage in terms of high flexibility but also a disadvantage because it comes with some gotchas that make it "almost like Python" at times.
I hope this provides some context, please let me know if anything doesn't make sense and would love to hear thoughts from any pyinfra users on the above. Ultimately I think any changes to these systems are on the table assuming enough support and technical possibility :)
Coming back on your example:
- Connect to all 4 servers in parallel
- For each server sequentially: a. execute
deploy.py
withhost
set to that server b. as facts are required (files.File
), load them on all 4 servers in parallel- Now we have our operation order & commands to run
- Execute all the operations in parallel across the hosts
The flow I am referring to is the following:
In parallel on all 4 servers:
deploy.py
, which gather facts and run commands along the wayIf I take the different points you mention, I think they can be supported with this flow:
deploy.py
without executing the commands. You can even have a dry_run
variable set to true
for that purpose that user can inspect in order to adjust the execution in case of dry run.deploy.py
contains for different hosts, the execution is totally independent on every hosts.Multiple code paths for different hosts: It does no matter what deploy.py contains for different hosts, the execution is totally independent on every hosts. Operation order is correct per host according to expected python execution order, and it supports all python features. There is no attempt to synchronize operations across hosts except at the end, if one host finish before others, very well.
The problem with doing this is it breaks a number of scenarios in which the order across multiple costs is essential. Really anything involving a control / worker node setup. For example bootstrapping an Elasticsearch cluster might look like:
control-1
, data-1
, data-2
, data-3
control-1
There are many similar examples of this where operations on one node must complete before operations on another. The operation order as it currently exists handles this.
Pyinfra is not the right tool for that. I am doing similar things on my side, and there is better ways to do it than pyinfra.
Setup/configuration and actually deploying/running an application are two different things that comes with their own challenges and pyinfra is not equipped at all to deal with the challenges of bringing a cluster of applications up.
Typically, bringing up an ES cluster need to deal with scenarios where you add/remove nodes to the cluster, handle failure cases if the master fails, rollback to a previous working state etc..
Regardless, if such a case happen, pyinfra could expose an interface to synchronize the execution across all hosts at any time during execution. asyncio does provide the necessary synchronization mechanisms for that. I imagine this is not going to be a very popular feature. I have the feeling that mass configuration of multiple independent host in parallel is much more common.
I can absolutely say pyinfra works incredibly well for the ES use case, it's one of the places it was first used at scale within a company environment to manage tens of large (300+ node) clusters.
I am aware of its use in production today for all sorts of setups including ES, MariaDB and Kubernetes clusters. The operation ordering is relied upon to provide consistent idempotent deployment of these services. Because of this ops like adding/removing nodes can be achieved simply by updating the inventory and re-running.
To come back to the original question. What can be done when splitting the execution in 2 steps that cannot be done without this split ? I think it important to answer that because this split comes at the great cost of:
I think at least the following would be difficult with single-run execution:
We could use some command line ANSI escape code magic to update previously printed operations for hosts that have now also hit them. This might look something like:
--> Starting operation: Apt/Packages (packages=['vim'])
[host-1] No changes
[host-2] *executing* <insert loading animation here>
--> Starting operation: Apt/Packages (packages=['vim'], present=False)
[host-1] *executing* <insert loading animation here>
[host-2] *waiting for execution* <insert loading animation here>
...2 seconds later...
--> Starting operation: Apt/Packages (packages=['vim'])
[host-1] No changes
[host-2] Success <-- this line changed even though it was already printed
--> Starting operation: Apt/Packages (packages=['vim'], present=False)
[host-1] Success <-- same here
[host-2] Success <-- and here
While not ideal, this could be solved with a synchronization key:
burn_server_op_name = 'Set server on fire'
if host in inventory.get_group("flammable_nodes"):
pyinfra.operations.server.burn_server(name=burn_server_op_name, appoval_needed=True)
else:
pyinfra.operations.skip(name=burn_server_op_name)
--> Starting operation: Burn server (key='Set server on fire')
Do you want to continue with 'Set server on fire' with the following changes:
[host-1] Success: <some information about the planned execution>
[host-2] Success: <some information about the planned execution>
[host-3] Skipped
Continue? [y/N]
The key could just be name
, or it could be something separate like key
or sync_key
.
As @julienlavergne pointed out this could be solved using a synchronization API:
if host in inventory.get_group("control_nodes"):
# control stuff (1)
else:
# data stuff
pyinfra.wait(key = "Wait for control-1")
if host not in inventory.get_group("control_nodes"):
# data stuff that depends on (1) ("start ES on the three data nodes")
asyncio.Event doesn't need key
because it's used in the same execution context. The different host runs of pyinfra however do not share the same context (I think?).
I think using a key is fine, but we could move per-host code into the same context like so:
control_done_event = asyncio.Event()
async def deploy(host):
if host in inventory.get_group("control_nodes"):
# control stuff (1)
control_done_event.set()
else:
# data stuff
await control_done_event.wait()
# data stuff that depends on (1) ("start ES on the three data nodes")
Where this file is executed once and then deploy
is called for each host. This calling for each host could event be delegated to the the deploy file itself:
await asyncio.gather(deploy(host) for host in hosts_to_execute_magic_variable)
Although I don't know how pyinfra would be able to figure out which host an operation is being executed for.
For this one I cannot really think of a solution.
[1] I guess operations called with different arguments are considered the same.
P.S. Amazing project, thanks for all the hard work! :heart:
Just ran into the limitation of the two-phase deployment as well trying to use a custom fact which relies on a tool being installed in the same deployment here (https://github.com/mvgijssel/setup/pull/288/files#r1205699420). Also was expecting host.reload_fact(...)
to solve this for me, but digging a bit further I guess it makes sense this method doesn't solve it.
If you are contemplating changing the execution strategy maybe you need to take a look at how Pulumi does it. It's similar to Terraform, but then also writeable in Python 🎉. Basically turning all operations into promises and executing stuff once all dependencies are resolved, giving you a N-phase deployment!
Just filed an issue about the fact not being updated properly by the apt
operation: #987
It seems that fact-gathering makes things more stateful and causes this kind of consistency problems between the "facts" and the real system.
I’m convinced, it has become clear the disadvantages and confusion of the current execution strategy outweigh the advantages.
To remedy this I am proposing that pyinfra v3 will switch to executing operations and facts live. It will also be able to retain diff functionality simply by collecting facts twice if so desired. I’ve started working on this in #996, is working but tests/etc are not.
v3 implements this, beta is out and pending full release imminently.
Is your feature request related to a problem? Please describe
I have been modifying a local version of pyinfra in order to solve different issues (some of them posted as github issues here), and I came to the conclusion that there is not much advantages in gathering facts before operations. Actually, most of the limitations and weird behaviors I see would be solved by gathering facts along the way.
I would be interested to discuss to pros and cons and contribute in modifying the way pyinfra work if necessary.
Describe the solution you'd like
I have a small list of the advantages of gathering facts before operations that requires them:
preserve_loop_order
magic anymore. It was anyway very counter-intuitive to have operations in loops not executed in the expected order.assume_present
arguments becomes unnecessary since facts will reflect the correct state of the machine right before an operation is performed.server.shell
.