Demote "specialized_init_system" certification test from "essential" to "normal"

lfn-cnti / certification

📞📱☎️📡🌐 Cloud Native Telecom Initiative (CNTI) Certification program

https://wiki.lfnetworking.org/display/LN/Certification

Other

22 stars 13 forks source link

Demote "specialized_init_system" certification test from "essential" to "normal" #129

Open martin-mat opened 2 months ago

martin-mat commented 2 months ago

Specialized_init_system in my opinion does not deserve to be "essential" for following reasons:

the main reason for having init system inside containers is handling signals and zombies. There are specialized tests already among essential tests for those so this one seem to be redundant. Additionally, specialized init system is just one of ways how to handle signals/zombies.
there is no good universal check for a specialized init system in cnf-testsuite. Current implementation checks for use just of "tini", "dumb-init", "s6-vscan" which is limiting.

I propose to leave it among certification tests, but demote from "essential" to "normal".

taylor commented 2 months ago

☝🏻 👀 @tomkivlin @pensarguet @jeffsaelens @vukg @electrocucaracha @nickolaev @wavell @HashNuke @denverwilliams

wvwatson commented 2 months ago

You can still have zombies with one of the popular, proper init systems like dummy init, just fyi. So another redundant check for zones is warranted.

If you know of well known init systems, feel free to add them.

Outside of that, reducing the amount of tests for microservices this will make passing harder, not easier. This is because you'll have less 'chances' to prove your good microservice practices. For example, if we removed the test you would now have 17 chances to pass 10 tests instead of 18 chances.

This count of tests was done to capture the idea that what is a cloud native good practice, e.g a good microservice practice, is hazy.

Check https://cloud.google.com/blog/topics/developers-practitioners/graceful-shutdowns-cloud-run-deep-dive

https://cloud.google.com/architecture/best-practices-for-building-containers

And https://ahmet.im/blog/minimal-init-process-for-containers/ for some background

agentpoyo commented 2 months ago

Outside of that, reducing the amount of tests for microservices this will make passing harder, not easier. This is because you'll have less 'chances' to prove your good microservice practices. For example, if we removed the test you would now have 17 chances to pass 10 tests instead of 18 chances.

Are we still requiring 10 tests to pass with the most recent addition of 'essential" tests? On a few tests, I was seeing Coredns CNF fail the cert when it passed 14 of 19 tests.

essential_tests

martin-mat commented 2 months ago

@wavell I don't doubt that using a specialized init system is a best practice for many use cases. I'm questioning its categorization to "essential" certification tests. The basic question we should answer: is usage of specialized_init_system test essential best practice?

My view on it is: essential best practice is to handle correctly signals and zombies. We have this covered already in other tests. As best practices describe, vendors have basically 3 options how to handle this, just one of them is specialized init process. And vendors can have their reasons to choose one of 2 different ways how to handle signals/zombies. From that point of view, I don't see this check as "essential". I'm open to be convinced, but I haven't seen a strong reason so far.

As for number of tests: Making certification tests passing easier is not a motivation for initiating this discussion. Keeping a test in certification criterions just because keeping numbers seems to me as a weak reason. However, in general, if number of essential certification test is decreased, certification criterions should be reconsidered/possibly changed.

@wavell @agentpoyo current certification criterion is to have 15 out of 19 essential tests passed.

@agentpoyo I can see you are not using latest main code, pod_dns_error has been removed from certification.

sorry martin, i edited your post by accident when hitting quote reply on my phone. Note to self: that button is dangerous.

agentpoyo commented 2 months ago

@agentpoyo I can see you are not using latest main code, pod_dns_error has been removed from certification.

Yeah, sorry, that was the only recent screenshot I had while doing tests prior to the removal. I was only confirming what Watson stated about 10 essential tests passing would pass the certification.

agentpoyo commented 2 months ago

@wavell @agentpoyo current certification criterion is to have 15 out of 19 essential tests passed.

That answers my question but if a essential test is removed and goes from 19 to 18 or even 17, are we lowering the number of essential "passed" to a lower number as well for one to pass the certification? I think this is what @wavell was detailing about the chances of passing tests to pass certification, not the test itself. If we keep the threshold at 15 but remove a test as "essential", we should consider promoting another test to essential or lower the threshold of 15.

martin-mat commented 2 months ago

@agentpoyo

However, in general, if number of essential certification test is decreased, certification criterions should be reconsidered/possibly changed.

electrocucaracha commented 2 months ago

My view on it is: essential best practice is to handle correctly signals and zombies. We have this covered already in other tests. As best practices describe, vendors have basically 3 options how to handle this, just one of them is specialized init process. And vendors can have their reasons to choose one of 2 different ways how to handle signals/zombies. From that point of view, I don't see this check as "essential". I'm open to be convinced, but I haven't seen a strong reason so far.

@martin-mat even the authors of this article recognized at the beginning that not all the best practices are applicable. Personally, I prefer to use the Do One Thing and Do It Well philosophy here, otherwise you will have to deal with external factors not related to C/VNF. Let's take tini tool as example, it has MIT license, rely on maintainers to fix CVEs, the project has slowed down its contributions, last change date on Jan 30, 2023 and last binaries were released on Apr 19, 2020. Do the C/VNF vendors need to deal with all of this? Shouldn't be better to enforce them to use the existing specialized init process? Who should need to be responsible to handle signals/zombies properly?

wvwatson commented 2 months ago

" And vendors can have their reasons to choose one of 2 different ways how to handle signals/zombies. From that point of view, I don't see this check as "essential"."

You are getting hung up on the word "essential". Essential doesn't mean required. It means heavily weighted in this test suite.

"I'm open to be convinced, but I haven't seen a strong reason so far."

The strong reason is the recommendation from the Google docs et al previously posted. Specialized init system is one of the ways to handle this which is recommended from multiple sources.

For these discussions we try to reference best practices from other upstream projects that have authority in the space, in this case a hyperscaler that hosts containers, when driving the test rationale. If you have some references on this subject feel free to add them to the discussion.

"As for number of tests: Making certification tests passing easier is not a motivation for initiating this discussion. Keeping a test in certification criterions just because keeping numbers seems to me as a weak reason."

There was a drive to reduce the weighting of microservice tests (specifically the single process type test) because it was a too difficult an area to pass. With multiple ways to increase the hygiene of the microservice (i.e. avoiding lift and shift) adding more pathways for participants to pass in a difficult category reduces the weighting. Again the essentials are about weighting not requirement.

kosstennbl commented 2 months ago

Correct me if i'm wrong in my logic, but i think that "specialized_init_system" is not a correct design of a test and it would be better if it was removed (or rather replaced, more on that below).

As i see it - best practice of using specialized init system shouldn't be centered around specific names of init systems, but rather around benefits, safeguards and functionality that they provide.

If that is correct - why do we test for the init system names, rather then for the actual points of the best practice?

By having set list of the "allowed" init systems - we overextend our trust to these systems and possibly undervalue any init system that could be compliant to all the best practices, but is "underweighted" just because it isn't in the list.

If there was a certification of some sort for init systems - this test could make more sense and be renamed to "certified_init_system", but, to my knowledge, there is nothing like that.

P.S. This is not a call to act. CNTI Certification and CNF-Testsuite have more significant issues to attend to, but to me - it seems that issues like this - can hurt CNTI in a long-term: directly and by having a precedent which could be used in other discussions.

wvwatson commented 2 months ago

"As i see it - best practice of using specialized init system shouldn't be centered around specific names of init systems, but rather around benefits, safeguards and functionality that they provide."

I cited multiple references where indeed, using specifc independently and openly developed and maintained "specializeed init systems" is recommended. Eg. Dummy-init is maintained by Yelp. Rolling your own is it's own strategy and not best for most, which is why the cited references call it an option. Writing a sophisticated supervisor, let alone a "lite" one is non-trivial and covered in the references and in the community.

Please cite a reference for why you say relying on specific, openly developed supervisors is not good to support your argument. I have provided multiple, and can provide more, saying it is a best practice.

martin-mat commented 2 months ago

bitnami container images, widely used and well designed, respected cloud-native authority, don't use specialized init system while still handling correctly signals and zombies.

electrocucaracha commented 2 months ago

bitnami container images, widely used and well designed, respected cloud-native authority, don't use specialized init system while still handling correctly signals and zombies.

I just checked their repo, they only have 2 of 335 images (gitlab-runner and jupyter-base-notebook) that uses specialized init system.

kosstennbl commented 2 months ago

@wavell

Please cite a reference for why you say relying on specific, openly developed supervisors is not good

I'm not saying that that is not good, but rather pointing out that it shouldn't be the main goal (and subsequently - best practice). I think that we should test the functionality that init system provides rather then if it's name is in the list.

I have provided multiple, and can provide more, saying it is a best practice

Let's go through the references, that you have provided:

https://cloud.google.com/blog/topics/developers-practitioners/graceful-shutdowns-cloud-run-deep-dive

Post about graceful shutdowns for containers. I agree, that this should be a best practice, and this best practice is not tied to specific init systems, we need to check if signals are being correctly trapped and shutdown is handled.

https://ahmet.im/blog/minimal-init-process-for-containers/

Post that contains list of the init systems, that an engineer have considered and found from Twitter for his needs. It's a nice info, but doesn't contribute much to the idea of having few systems as a best practice.

https://cloud.google.com/architecture/best-practices-for-building-containers

To me - strongest reference, document that contains a list of best practices for building containers:

Package a single app per container
Properly handle PID 1, signal handling, and zombie processes
Optimize for the Docker build cache
Remove unnecessary tools
Build the smallest image possible
Scan images for vulnerabilities
Properly tag your images And these are great practices that are already tested or should be tested for certification. Please notice that using specialized init processes is not one of them. In this page - using specialized init systems is a recommendation that is written as one of the options for solving signal handling.

And that last sentence - totally aligns with my position: I agree with having recommendation to use specific specialized init systems, but i don't agree with having it as a best practice. It should be a way of achieving compliance with the best practices, not a best practice itself.

P.S. I think we should be careful when using references to the blog posts as a ground truth. It surely can be good information from knowledgeable person, but i don't think that it should be weighted significantly more then comments from the people out there in the issue.

wvwatson commented 2 months ago

@wavell

Please cite a reference for why you say relying on specific, openly developed supervisors is not good

I'm not saying that that is not good, but rather pointing out that it shouldn't be the main goal (and subsequently - best practice). I think that we should test the functionality that init system provides rather then if it's name is in the list.

You should provide references for what you think.

Let's go through the references, that you have provided:

https://cloud.google.com/blog/topics/developers-practitioners/graceful-shutdowns-cloud-run-deep-dive

Post about graceful shutdowns for containers. I agree, that this should be a best practice, and this best practice is not tied to specific init systems, we need to check if signals are being correctly trapped and shutdown is handled.

I see that you skipped over this from the reference: "Similarly, if you use an entrypoint script to kick off background processes in your containers, consider using a proper init process that can forward signals to child processes, such as tini, dumb-init or supervisord. (I have compared init process alternatives for the multi-process container use case here.)"

https://ahmet.im/blog/minimal-init-process-for-containers/

Post that contains list of the init systems, that an engineer have considered and found from Twitter for his needs. It's a nice info, but doesn't contribute much to the idea of having few systems as a best practice.

The engineer is the same engineer that writes for, and is employed by, Google cloud on the other post ... He's not "just an engineer".

https://cloud.google.com/architecture/best-practices-for-building-containers

To me - strongest reference, document that contains a list of best practices for building containers:

Package a single app per container

Properly handle PID 1, signal handling, and zombie processes

Optimize for the Docker build cache

Remove unnecessary tools

Build the smallest image possible

Scan images for vulnerabilities

Properly tag your images And these are great practices that are already tested or should be tested for certification. Please notice that using specialized init processes is not one of them. In this page - using specialized init systems is a recommendation that is written as one of the options for solving signal handling.

Let's quote it

"Solution 3: Use a specialized init system

As you would in a more classic Linux environment, you can also use an init system to deal with those problems. "

It's quoted as a solution.

And that last sentence - totally aligns with my position: I agree with having recommendation to use specific specialized init systems, but i don't agree with having it as a best practice. It should be a way of achieving compliance with the best practices, not a best practice itself.

It's time for you to post your references

P.S. I think we should be careful when using references to the blog posts as a ground truth. It surely can be good information from knowledgeable person, but i don't think that it should be weighted significantly more then comments from the people out there in the issue.

See above reference about the "engineer"

kosstennbl commented 2 months ago

@wavell I see that this discussion is turning into the circle.

About the references: I'm saying that your references actually prove my point more than yours. On all pages - usage of specialized init systems is a way to achieve best practice, not a best practice itself.

About the Mr. Ahmet Alp Balkan, author of the posts: I'm not undervaluing him as an engineer, I'm only saying that we should be careful by taking opinions of a single person as a ground truth and as a final argument. (Even through in this case - i agree with correctness and usefulness of the references provided).

I have little interest in continuing this discussion, as i see a need of third party intervention for us to find common ground on this topic and, in my sight, the problem is not important and critical enough to spend more time debating about it there.

taylor commented 2 months ago

From the Google's Best practices for building containers, they list 2 problems that are being addressed with a specialized init system:

Problem 1: How the Linux kernel handles signals
Problem 2: How classic init systems handle orphaned processes

They list 2 other solutions.

Solution 1: Run as PID 1 and register signal handlers
Solution 2: Enable process namespace sharing in Kubernetes

The first solution, handling the signals, does not deal with the second problem of orphaned processes. The special init system handles this issue. The second solution also addresses both problems.

When you enable process namespace sharing for a Pod, Kubernetes uses a single process namespace for all the containers in that Pod. The Kubernetes Pod infrastructure container becomes PID 1 and automatically reaps orphaned processes.

taylor commented 2 months ago

Continuing from the document regarding the specialized init system, they have the following.

As you would in a more classic Linux environment, you can also use an init system to deal with those problems. However, normal init systems such as systemd or SysV are too complex and large for just this purpose, which is why we recommend that you use an init system such as tini, which is created especially for containers.

If you use a specialized init system, the init process has PID 1 and does the following:

Registers the correct signal handlers.

Makes sure that signals work for your app.

Reaps any eventual zombie processes.

Tini is recommended by many larger projects, providers and experts which is why it keeps making it to the list. For example docker mentions it all over including in their discussion about the init system in containers here https://github.com/docker-library/official-images#init

It is also recommended to use an init system if your container is

running multiple processes (eg. sub-processes, children, etc)
single process does not handle signals (eg. SIGTERM is ignored)

It seems like there could be some check for

a single process running (only PID 1)
the process handles signals

in which case an init system would not be needed. That said maybe the test runs when there are no child processes but later in the lifecycle a child process starts... how do we know?

My understanding of the objection to this test is that it is only checking for a list of validated "good" init systems in this test is that the test is not all-inclusive of init systems that meet the recommendations for a minimal init system that meets the requirements listed above.

Sometimes testing functionality, and application qualities/attributes is difficult to implement or the test coverage area is open-ended and ambiguous. That can happen because the current landscape is large or it is actively growing/changing). When that is the case and time/resources/priorties do not support implementing this type of more extensive and open ended testing a short cut is used: find pre-validated options and use those as a reference list of good choices. This is what happened with this test.

A few items that the test coverage to allow any init system would include the items previously listed:

Registers the correct signal handlers.
Makes sure that signals work for your app.
Reaps any eventual zombie processes. but it also needs to validate that the init system is not too complex. What would that testing look like? How would you check for the complexity seen in systemd? You could create a list of "bad init" systems. We are just going the other way though.

I would love to see a set of tests that can let us keep this flexible enforce the best practices for containers.

In the meantime if there is an application that does not pass this test, that we care about for passing, then I would like to examine the application and see how we can improve the test or the application. For example if a CNF uses an unlisted open source init system that could be considered minimal and handles the init requirements for a container then we can add that init system to the list.

If the init system is closed source and unknown... well I do not know what to do regarding this test when we have no other tests validating if it is minimal (not complex) and handling all requirements.

The single process situation should be handled separately from multiple processes IMO

tomkivlin commented 2 months ago

Reading this, it feels that it's not about whether a specialised init system is a best practice or not (I think everyone in the thread agrees it is), but a measure on the importance of that best practice.

It feels like this is a more generic question: how do we guage whether a best practice is essential or not? What are the metrics by which we define a best practice as essential?

Feels like once we have such a definition, we can then discuss the original point which wasn't whether this is a best practice or not, but whether it is essential or not.