Audit results - Githubissues

jcgruenhage commented 5 years ago

I'll put small things that I notice doing my audit of this here, which aren't big/important enough to get their own issue.

[x] self_check_dns.yml: maybe use the dig lookup plugin instead of a command, and if you use a command, use a command and not a shell.
[ ] There are a lot of set_fact tasks, that don't depend on anything that is only available at runtime, moving those into defaults or vars (depending on what they are exactly) would make sense IMO.
[x] There should not be a default UID/GID for the matrix user (if this is even a user that should exist, having dedicated users for components would also be a possibility). When creating a user using ansible it has the UID/GID in the result and then you can use that.
[ ] The whole letsencrypt business is a bit weird, I really don't like certbot. Have you thought about maybe using dehydrated here? I think it integrates with ansible+containers a lot better than certbot
[ ] setup/ssl/setup_ssl_self_signed_obtain_for_domain.yml: You could generate the ssl cert manually only on hosts which don't provide up to date pyopenssl and do the proper way on other hosts
[x] periodic nginx restarts could be done with the crontab module instead of templating a cron.d file
[x] the start tasks thingy could (and should) be replaced with a loop. also, daemon_reload is not needed there. We also don't need daemon_reload in a lot of other cases, that should be handled with a handler and not a task.
[x] going from yaml truth values to json truth values via "true if var else false" is ugly, you can use the |to_json filter for that
[ ] if the docker_container module of ansible is used for launching the containers, the env file templates are unnecessary, the vars can just be taken from vars directly

spantaleev commented 5 years ago

Thank for your detailed report!

Some of these are already fixed, while others are in progress. I'll go through all of them below, in order.

self_check_dns.yml: maybe use the dig lookup plugin instead of a command, and if you use a command, use a command and not a shell.

Fixed in f92c4d5a27d775.

During that, I've found out that we don't do a similar self-check for our mxisd install, so I've added support for that in ef2dc3745aab2d5.

There are a lot of set_fact tasks, that don't depend on anything that is only available at runtime, moving those into defaults or vars (depending on what they are exactly) would make sense IMO.

Can you give a specific example?

Maybe this is one?

- set_fact:
    matrix_synapse_app_service_config_file_mautrix_whatsapp: '/app-registration/mautrix-whatsapp.yml'

Defining this one in defaults seems useless however. It's not something we want/intend people to change. Using a variable is a nice way to not duplicate it around the tasks file and risk a typo somewhere.

Putting it in some other variables file which cannot be overriden is a possibility.But also somewhat negative. To comment on this specific example, all whatsapp-related "code" is just in the defaults file (where configuration usually goes) and in setup_synapse_ext_mautrix_whatsapp.yml. It's nicely contained.

There should not be a default UID/GID for the matrix user (if this is even a user that should exist, having dedicated users for components would also be a possibility). When creating a user using ansible it has the UID/GID in the result and then you can use that.

That's probably nice to do and we should investigate how to do it.

Some components need to share a user (matrix-synapse needs to be able to access its media_store, which may be picky if it's on Amazon S3 using Goofys)

The whole letsencrypt business is a bit weird, I really don't like certbot. Have you thought about maybe using dehydrated here? I think it integrates with ansible+containers a lot better than certbot

I've just taken a look at dehydrated and it seems like it requires there to be some HTTP server to actually serve the challenge files.

During playbook execution (initial install, at least), there is no such server, so we rely on certbot to start it for us.

It should be noted that this playbook was using acmetool before, via the willwill/acme-docker Docker image. At some point, that image got deprecated, so we were forced to migrate.

Although I don't like that certbot is somewhat larger (36MB Docker container) compared to acmetool's 8MB, certbot seems like a very maintained and active project. One that will probably not go away on us, like acmetool did.

So far there haven't been any problems with certbot, so it seems like we should only replace it with something else if there's some good reason for it. If you can think of such reasons, let's proceed to discuss it.

setup/ssl/setup_ssl_self_signed_obtain_for_domain.yml: You could generate the ssl cert manually only on hosts which don't provide up to date pyopenssl and do the proper way on other hosts

I agree.. We can do that, with the following caveats:

we'll have to do detection for pyopenssl (and its specific version first).. or rather: try the proper way -> catch a failure -> see if it's the failure we expect (and not some other failure) -> possibly clean up what we've started and will abandon -> fall back to an openssl call
we need to have 2 code paths that do the same thing, as opposed to having just 1 (albeit, an uglier) code path that possibly works for everyone even now.

I'm a bit conflicted on how cleanly this can be implemented and how much complicated it will make setup_ssl_self_signed_obtain_for_domain.yml. Right now, it's very simple/straightforward and seems to work great in various places. If we are to make it much longer and complicated, we should consider what we'll win by it (besides having a proper Ansible-way example in the code).

periodic nginx restarts could be done with the crontab module instead of templating a cron.d file

I've done some work in #68 and would love to hear some comments about it.

the start tasks thingy could (and should) be replaced with a loop. also, daemon_reload is not needed there. We also don't need daemon_reload in a lot of other cases, that should be handled with a handler and not a task.

A loop sounds good!

Getting rid of daemon_reload here may be good as well. At least it can be done just once.

As for using a handler, my experience shows that using handlers is unreliable. Say you run the playbook and it installs a .service file.. And then the playbook fails on some next task for some reason. Because handlers are executed at the end of each block of tasks in a play (source), a handler may not get executed. The user would then re-run the playbook, the .service file would not get installed (as it already exists), the handler would not get notified (as nothing changed!), and then we wouldn't have called daemon_reload.

Let me know if anything improved in Ansible's handlers situation since I've last tried it. If not, we'd probably better stay away from such fragile features.

To get back to the redoing of start.yml, I've thought about redoing this part many times, but never really got around to it.

We'll need to build a list of services we want to start (some are conditional - e.g. matrix-corporal, matrix-nginx-proxy, matrix-goofys, etc.).

going from yaml truth values to json truth values via "true if var else false" is ugly, you can use the |to_json filter for that

Fixed in 00ae4350441ea.

if the docker_container module of ansible is used for launching the containers, the env file templates are unnecessary, the vars can just be taken from vars directly

Care should be taken, because restarting one service (say matrix-postgres), may pull down another (matrix-mxisd) which depends on it. Our invocations are currently in a good order to prevent that from happening, but it's a possibility.

If we are to do state: restarted in combination with with_items, it will probably go through the list and call systemctl restart <service> once for each one, in that order. If the order is not optimal, we may cause more restarts than necessary.

To remove the necessity to think about order (and leave that to systemd, where dependency configuration exists already), we should probably do it in 2 passes. Something like:

- name: Ensure all services are stopped
  service:
    name: "{{ item }}"
    state: stopped
  with_items: "{{ services_list }}"

- name: Ensure all services are started
  service:
    name: "{{ item }}"
    state: started
  with_items: "{{ services_list }}"

systemd would then start any dependencies it sees for whichever service it's trying to start.. And if some dependency is started like that, and we try to start it manually, it will just be a no-op.

if the docker_container module of ansible is used for launching the containers, the env file templates are unnecessary, the vars can just be taken from vars directly

I'm not sure why we manually run the docker command now, instead of using docker_container in more places. It probably has something to do with docker_container not supporting certain features or only supporting them in some newer docker-python library (which is hard to obtain).

It will be nice to revisit those instances and see (test) what can be converted.

Another side-benefit of using docker run is that sometimes what we run is very close to something else used elsewhere. For example, the SSL renewal script that we install on the server, performs some docker run command which is very similar to what our playbook uses during execution. Maintaining these 2 related things is much easier when they both use docker run.

Still, some of the docker run calls do have such a benefit to hold them back, and if possible to migrate to docker_container, I agree that we should.

jcgruenhage commented 5 years ago

Thanks for the detailed feedback. Why doesn't github have threading in comments? This sucks majorly. Anyway, here I go:

Putting it in some other variables file which cannot be overriden is a possibility.But also somewhat negative. To comment on this specific example, all whatsapp-related "code" is just in the defaults file (where configuration usually goes) and in setup_synapse_ext_mautrix_whatsapp.yml. It's nicely contained.

Things that shouldn't be modified should be in vars, things that should be modified in defaults. About things being more contained, that's one reason why I want things to be split up into dedicated roles. We can also have a dedicated whatsapp file in the vars directory.

Some components need to share a user (matrix-synapse needs to be able to access its media_store, which may be picky if it's on Amazon S3 using Goofys)

They need to be in a common group, with permissions on those directories being set correctly.

I've just taken a look at dehydrated and it seems like it requires there to be some HTTP server to actually serve the challenge files.

During playbook execution (initial install, at least), there is no such server, so we rely on certbot to start it for us.

If we install nginx anyway, we might as well just install it before generating the certificates and just enable the sites needing those certs after certs where generated. Alternatively, dehydrated supports dns-01, or we could run darkhttpd (a 80kB HTTP 1.1 server) in the dehydrated container, serving the files.

So far there haven't been any problems with certbot, so it seems like we should only replace it with something else if there's some good reason for it. If you can think of such reasons, let's proceed to discuss it.

Keeping certbot is fine, I'm just not a huge fan of it.

I'm a bit conflicted on how cleanly this can be implemented and how much complicated it will make setup_ssl_self_signed_obtain_for_domain.yml. Right now, it's very simple/straightforward and seems to work great in various places. If we are to make it much longer and complicated, we should consider what we'll win by it (besides having a proper Ansible-way example in the code).

I agree. Not all of the things I've found are binary decisions, this is very much in conflict. I agree with keeping it how it is right now.

Let me know if anything improved in Ansible's handlers situation since I've last tried it. If not, we'd probably better stay away from such fragile features.

What you describe is very annoying, yes, but there is a solution! You can use a meta task to flush handlers, which will cause all pending handlers to be executed now instead of at the end of the play.

Care should be taken, because restarting one service (say matrix-postgres), may pull down another (matrix-mxisd) which depends on it. Our invocations are currently in a good order to prevent that from happening, but it's a possibility.

That is a reason to use docker_service instead of docker_container, to make sure we get a dependency tree there too. Using docker_service instead of docker_container means that splitting up the things into dedicated roles would become harder.

Another side-benefit of using docker run is that sometimes what we run is very close to something else used elsewhere. For example, the SSL renewal script that we install on the server, performs some docker run command which is very similar to what our playbook uses during execution. Maintaining these 2 related things is much easier when they both use docker run.

I sort of agree with that one, but I still don't think that docker binary inside systemd is a good idea.

spantaleev / matrix-docker-ansible-deploy

Audit results #65