Work with SRE to write some basic requirements

michaelpj commented 2 years ago

I've reached out.

blaggacao commented 1 year ago

Would it be too much of a plug to paraphrase the 4 layers of packaging (by someone who can explain)? And even recommend in the practices section the use of Nix/Standard?

If you agree on this direction, you can assign me.

angerman commented 1 year ago

From the contention I've seen around nix/std, I'm not sure recommending this in the practice section of the book is a good idea.

I do however agree that we should be clear in what interfaces we want to make operation and deployment easy. Can we structure this from how deployment happens, what is required for it, and why that is important? I think that would make it easier for most engineers to follow that model. If we can motivate the use of nix (and std) from that naturally, that would probably be good.

michaelpj commented 1 year ago

From a policy perspective I think we might say:

A project that needs to be deployed by SRE MUST either have packages buildable with Nix or provide an OCI image (is that right @blaggacao ?), link to the SRE handbook for justification
A Haskell project that builds with Nix MUST use haskell.nix (seems true to me?)

From a practices perspective I think we could say:

One way to comply with the SRE requirements is to build your projects with Nix, this also makes it easy to get images, see examples X, Y, Z
Many projects use std to organize their Nix code, see examples...
- i.e. don't make this even a SHOULD or MAY, just a practice people might want to follow

michaelpj commented 1 year ago

To be clear, the practices section is not intended to be anything like as normative as the policy section, it's much more "here are some ideas you might want to copy if you like them". Part of the point of that is so that we can exhibit a variety of more opinionated approaches that various projects have taken so people know about them and we can grow towards maybe reaching consensus on some of them.

blaggacao commented 1 year ago

Sounds pretty good!

I agree with the cautionary voice of not overclocking on the recommendation (which might backfire, anyways).

But I also think we shouldn't overclock on scepticism of what has proven to be actually quite useful, already (because it reinvigorates what is in considerable portions a sentimental cycle around change).

Fundamentally, the important thinking is (MUST / policy):

package (just the raw binary thing)
operable (package + runtime environment, not an OCI [yet] -- can think of it as an "entrypoint" or a "proto-OCI")
- we need a dedicated middle ground to wrap-fix the binary and convey structured info about config options
OCI made out of an operable
SHOULD: scheduler manifests for a cloud scheduler (such as k8s, etc.)
MAY: operator service (long running control service for the binary)

Because of reproducability concerns, we shouldn't be accepting Dockerfile containers (MUST NOT), it introduces a whole unnecessary error class into ops for no substantial off-setting benefit.

Sidenote: In other kinds of material, we can even (very rightfully) spin this as a good choice for a blockchain infrastructure for the world.

I think we can allow non-reproducible binaries (SHOULD) if packaging for that language is unnecessary hard, at least as a temporary work-around, until the situation for that language improves.

I have not real opinion about how to package haskell. Haskell.nix is fine (although IFDs have caused headaches at various levels e.g. CI, but we sort of have learned to deal / live with it).

angerman commented 1 year ago

I'd like is to focus on what our constraints are and why? And then have that lead to the how (we are currently doing it). I think that argument is much stronger and persuasive.

SRE overstates a container based platform for service deployment, this means that services to be deployed by SRE need to be packaged as Open Container Initiative (OCI). OCI images consist of ...

For audibility of deployment products a reproducible build for release artefacts is required. Therefore builds need to make sure they do not download unknown data from the internet during the build (e.g. without ensuring that the same request always yields the same result, but verifying signatures and hashes). Exceptions are made for languages where this is currently impossible. (Which languages do we have that don't allow this right now?)

We have had good experience using nix for reproducible packaging of build artefacts for deployment and distribution, please see ... for examples.

I think we should be very mindful here that a lot of engineers may not have extensive cloud infrastructure or deployment experience, past operating a Debian based VPS/root server with packages from the pre-existing system package repository, and/or deploying their own application by copying a binary/php/ruby/python files (or some jar) to a server, and maybe writing some systemd/init scripts to launch them. Maybe some cron experience. Or Heroku like deployment. Maybe some docker service deployment.

blaggacao commented 1 year ago

Yep, that sounds pretty good! Maybe not too much detail (if we can avoid it) and stick to the high level aspects.

I think such a policy is a good place to nurture the need to broaden one's perspective into the realm of operations.

TxPipe is notably doing an excellent job in informing developmemt based on operational needs for building new solutions around Cardano.

So "blinkers on ops" is not a given, it may just be a local phenomenon to X.

michaelpj commented 1 year ago

Maybe not too much detail (if we can avoid it) and stick to the high level aspects.

I disagree. If we're going to turn this into policy that can be followed by people who don't have ops experience, then "high-level aspects" are totally useless. We need, specific, clear, actionable things that they can DO or NOT DO. Otherwise we don't have a usable policy. Or we have a policy that is only applicable if you have a resident SRE who can actually understand it.

e.g. "operable (package + runtime environment, not an OCI [yet] -- can think of it as an "entrypoint" or a "proto-OCI") " is IMO not something that people can follow, whereas "it must be clearly documented how to build a docker image of your application" is getting close to being specific enough that people might be able to do it (or be aware that they can't do it).

We can be a bit vaguer when writing practices, but I would really like to try and keep things as concrete as possible. No long texts about high-level abstract models of how to think about deployment, please.

angerman commented 1 year ago

Just to make sure I'm not misunderstood. I am absolutely for concretely laying out how to do things. I just also want this to be properly motivated and explicitly stated. I want us to be as autonomous as possible. Following a recipe without understanding why is hard and can easily lead to errors.

we want to do X (deploy a service on IOGs service infrastructure).
our constraints are Y (service infrastructure is OCI cointeiner based)
thus ... (deployable artifacts need to come in an OCI container, that looks like ...)
here's how this can be done (using a docker file, using nix2container, ...)

Also need to address:

service discovery (why, how)
secrets management (why, how)
interactivity with state (database, file storage, ...)
how to test this locally? If this can not be tested locally, how can one get access to a test environment?

It needs to be actionable for people with little deployment experience, but also motivate it well enough not just follow a recipe but know why.

blaggacao commented 1 year ago

Ok, convincing. Let's be super specific then. And decently motivate things (though not from first principles, people better read documentation about first principles than a policy).

michaelpj commented 1 year ago

Yes, we should have a rationale for policies. But if the rationale is too long then maybe we need to refer to another document :)

Also need to address:

a) we don't have to solve everything at once! b) I suspect many of these things will be sufficiently different between projects that we won't be able to write any sane policy that we can expect everyone to adhere to. Maybe we can write more in Practices.

Let me try to cut down the scope a bit: as a starting point, I would like us to get written down anything that the Haskell packages that make up the Cardano node need to do in order to support SRE.

blaggacao commented 1 year ago

Small semantic improvement:

to support SRE.

"to become operable".

It's not about SRE, actually.

input-output-hk / cardano-engineering-handbook

Work with SRE to write some basic requirements #15