operator-framework / operator-lib

This is a library to help Operator developers
Apache License 2.0
40 stars 41 forks source link

Can't run operators locally, when using the conditions package #50

Closed nunnatsa closed 2 years ago

nunnatsa commented 3 years ago

Bug Report

The NewCondition (and other) method, called (indirectlly) to the readSAFile method. This method uses a hard code path for the operator namespace file: https://github.com/operator-framework/operator-lib/blob/c0ba7dcbdce825f752802c871a0b52a37bfbf135/internal/utils/utils.go#L29

This implementation blocks the ability to run the operator locally for develpment and debug,

Please consider one of the following, or both (pefered):

  1. Replace the hard-code path with a environment variable, with a default of the current value.
  2. Use an environment variable to store the namespace, and use the file only if this variable is not set.

Environment

varshaprasad96 commented 3 years ago

Hi @nunnatsa , Thanks for looking into this. The PR does solve this issue. But however, the main use case of this library was to provide helpers to interact with the Operator Conditions created by OLM. This means that the operator is being run on cluster which has OLM installed. If not, we would not be able to get the name of the operator conditions CR which is set by OLM. There is a warning regarding this in the GetNamespacedName godoc. Can I know how you are testing this locally ? If you are running this on a dev environment without OLM, there shouldn't be an Operator Conditions CR created.

estroz commented 3 years ago

Relevant PR comment: https://github.com/operator-framework/operator-lib/pull/42#discussion_r528733678

nunnatsa commented 3 years ago

Thank you @varshaprasad96 for the detailed response.

As an operator developer, I need to be able to test my code and to debug it. I have many options to do that. For example, running with a real cluster, or running a unit test. In a cluster I can, for example, to manually deploy the opetator condition CRD and then the CR even without a running OLM.

The thing is that with the current implementation, it is harder to write a testable code.

estroz commented 3 years ago

The enhancement for this lib explicitly says that it assumes, and therefore was developed assuming, that OLM is installed. Additionally the CRD enhancement doesn't make an explicit goal for it to be used without OLM, although this is possible I guess. Regardless the intent of this library is specifically to be used with OLM. Out of curiosity why would you be using the operator condition CRD without OLM?

If you really want to, you still can use GetNamespacedName() for your use case like so:

// Set in test code.
var (
    testNS       = "foo"
    testCondName = "bar"
)

func (r *MemcachedReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    ...

    condKey, err := GetNamespacedName()
    if err != nil {
        // Errors pertaining to running outside of a cluster.
        condKey = &types.NamespacedName{Name: testCondName, Namespace: testNS}
    }

    ...

}
nunnatsa commented 3 years ago

The enhancement for this lib explicitly says that it assumes, and therefore was developed assuming, that OLM is installed.

But the code that uses it still need to be tested and debugging.

Additionally the CRD enhancement doesn't make an explicit goal for it to be used without OLM, although this is possible I guess.

There is no point to use this library without OLM, but still need to test and debug the code that uses it.

Regardless the intent of this library is specifically to be used with OLM. Out of curiosity why would you be using the operator condition CRD without OLM?

Yes! When I run a unit test of the function that called NewCondition. I can't do that using the current implementation.

If you really want to, you still can use GetNamespacedName() for your use case like so:

// Set in test code.
var (
  testNS       = "foo"
  testCondName = "bar"
)

func (r *MemcachedReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
  ...

  condKey, err := GetNamespacedName()
  if err != nil {
      // Errors pertaining to running outside of a cluster.
      condKey = &types.NamespacedName{Name: testCondName, Namespace: testNS}
  }

  ...

}

This code is a mix of a test code and a production code. There is no way to add a call to NewCondition and not getting error. So all the tests that get to this point will exit with an expected error, and rhere will no way to test what happends next.

estroz commented 3 years ago

But the code that uses it still need to be tested and debugging.

Yes, and the suggested way to handle running the tests locally is to skip conditions-related code when an error is returned.

Yes! When I run a unit test of the function that called NewCondition. I can't do that using the current implementation.

What I don’t understand is what you’re trying to test when you’re running your operator locally and trying to write to the conditions CR. In this scenario, OLM either isn’t running or doesn’t know about your operator if it is. Therefore you’re only testing if you can write to the condition CR, which isn’t really useful; in fact this test is effectively already being performed in the library’s unit tests.

This code is a mix of a test code and a production code. There is no way to add a call to NewCondition and not getting error. So all the tests that get to this point will exit with an expected error, and rhere will no way to test what happends next.

Yes it is a gross mix, but it does let you set the in-cluster condition CR and namespace if you really want to go down that path. However I don’t actually recommend using the code I suggested for the reason I gave above. You shouldn’t need to test writing operator conditions in non-cluster scenarios.

nunnatsa commented 3 years ago

Let's try to explain again the issue.

The operator we're working on does many things beside setting the the operator-condition. The code that do that is part of much larer code. In order to run unit tests we must mock the cluster. We're doing it to mock reading, writing and updating many types of objects.

But using this package as it implemented now will break any unit test we'll trying to write. We must simulate the running environment for tests, we can't skip it because this operaor need to eventually run in a cluster.

Sorry to write it, but this issue is a real issue that can't be rejected by "it's written in the godoc". You don't have to use the sugested solution, but please understand that this is not something we can skip.

estroz commented 3 years ago

I understand the issue you’re posing, and I think because the solution isn’t clear we need to improve the UX of this library, in particular handling the local/testing case.

I think the right solution is to create a test Condition that you can substitute in unit tests with a function variable, ex:

// controllers/kind_controller.go

var newCondition = conditions.NewCondition

func (r *MemcachedReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    ...

    cond, err := newCondition(client, SomeCondType)
    if err != nil {
        // This error will never happen in tests.
    }

    ...
}
// controllers/kind_controller_test.go

func TestReconciler(t *testing.T) {
    newCondition = conditions.NewFakeCondition

    // Run reconciler tests.
}

That would solve your problem correct?

erdii commented 3 years ago

How would I run this out-of-cluster (eg. when developing on my machine)?

nunnatsa commented 3 years ago

This is how we can run an operator from the IDE: https://github.com/kubevirt/hyperconverged-cluster-operator/blob/master/docs/run-locally.md

thetechnick commented 3 years ago

The issue is not only in testing, but also running the operator from a local machine for debugging. Due to the filesystem lookup an operator has to always run in the container filesystem, so something like Telepresence becomes mandatory for development.

OLM itself is already injecting environment variables for a few things and I don't see a rationale behind reading the namespace from file instead.

Kubernetes even has a downward API to make this task trivial:

        - name: MY_POD_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace

https://kubernetes.io/docs/tasks/inject-data-application/environment-variable-expose-pod-information/

I also don't think that having a FakeCondition to substitute would help, it doesn't address the portability issue during development. There is an argument to be had about not asserting on a change on the OperatorCondition object, but instead just mocking the call and asserting that a call has been made, but I don't think the API is complicated enough to require this library to ship the stub.

jmrodri commented 3 years ago

@nunnatsa From your original comment, the reason the directory is hard-coded is because this is where it will live in a Kubernetes environment. If you are running outside of the cluster that location will not exist. This is true when operator-lib is being unit tested outside the cluster. For example, with the leader code, we made the readNamespace a variable to a function: https://github.com/operator-framework/operator-lib/blob/main/leader/leader.go#L41. This allows it to be set for testing like this https://github.com/operator-framework/operator-lib/blob/main/leader/leader_test.go#L83-L85

I'm open to other solutions but the hardcoded path is where the cluster will put the namespace information. In the next comment I'll outline a few thoughts.

How are you trying to test your operator? unit tests? Or is it trying to run it locally like @thetechnick mentioned.

jmrodri commented 3 years ago

@estroz @varshaprasad96 I think we could add an Option to the NewCondition similar to what we did with Become in the leader code: https://github.com/operator-framework/operator-lib/blob/main/leader/leader.go#L88-L97

Basically if we change NewCondition to have an Option then callers can set their own information for things to override and make it testable. Something along the lines of:

func NewCondition(cl client.Client, condType apiv1.ConditionType, opts ...ConditionOption) (Condition, error) {
    ...
    for _, opt := range opts {
        if err := opt(&condition); err != nil {
            return nil, err
        }
    }
    ...
}

Or we could make GetNamespacedName a variable so that it can be set from a test.

var GetNamespacedName = getNamespacedName()

func getNamespacedName() (*types.NamespacedName, error) {
...
}

func NewCondition(cl client.Client, condType apiv1.ConditionType) (Condition, error) {
    objKey, err := GetNamespacedName()
    ...
}

From a test I should be able to override that GetNamespacedName.

nunnatsa commented 3 years ago

@nunnatsa From your original comment, the reason the directory is hard-coded is because this is where it will live in a Kubernetes environment. If you are running outside of the cluster that location will not exist.

And this is the issue, exactly.

This is true when operator-lib is being unit tested outside the cluster. For example, with the leader code, we made the readNamespace a variable to a function: https://github.com/operator-framework/operator-lib/blob/main/leader/leader.go#L41. This allows it to be set for testing like this https://github.com/operator-framework/operator-lib/blob/main/leader/leader_test.go#L83-L85

It won't work because NewCondition calls utils.GetOperatorNamespace with the same way you described but this is a local (private) variable I can't access.

I'm open to other solutions but the hardcoded path is where the cluster will put the namespace information. In the next comment I'll outline a few thoughts.

Please look at #51 for a suggestion. I don't know why it was closed. Using environment variable was always the easiest way to deliver data to pods.

How are you trying to test your operator? unit tests? Or is it trying to run it locally like @thetechnick mentioned.

Both

nunnatsa commented 3 years ago

@estroz @varshaprasad96 I think we could add an Option to the NewCondition similar to what we did with Become in the leader code: https://github.com/operator-framework/operator-lib/blob/main/leader/leader.go#L88-L97

Basically if we change NewCondition to have an Option then callers can set their own information for things to override and make it testable. Something along the lines of:

func NewCondition(cl client.Client, condType apiv1.ConditionType, opts ...ConditionOption) (Condition, error) {
    ...
    for _, opt := range opts {
        if err := opt(&condition); err != nil {
            return nil, err
        }
    }
    ...
}

Or we could make GetNamespacedName a variable so that it can be set from a test.

var GetNamespacedName = getNamespacedName()

func getNamespacedName() (*types.NamespacedName, error) {
...
}

func NewCondition(cl client.Client, condType apiv1.ConditionType) (Condition, error) {
    objKey, err := GetNamespacedName()
    ...
}

From a test I should be able to override that GetNamespacedName.

Why not using the existing namespace environment variable? Or inject it if it's missing? reading from a hard coded path seems a very fragile method to do things. I still think the solution in #51 give both of us what we want.

thetechnick commented 3 years ago

@jmrodri Why would you rather implement override options than just having the environment tell what the environment looks like?

That is also the reason why controller-runtime is first checking $KUBECONFIG and ~/.kube/config before falling back to assuming In-Cluster-Configuration, when no other clue is available. Portability matters, because we can't develop everything in the exact environment that it will be deployed to in production.

estroz commented 3 years ago

I don't think I've seen a (satisfying) answer to my original question: why do you need to use the condition CRD if you're running locally or testing? "My unit tests will fail if I don't" is a reason to mock these calls with a local-only Condition, not for why writing to a live conditions CR in-cluster is necessary in either case.

OLM itself is already injecting environment variables for a few things and I don't see a rationale behind reading the namespace from file instead.

Why not using the existing namespace environment variable? Or inject it if it's missing? reading from a hard coded path seems a very fragile method to do things.

@thetechnick @nunnatsa the rationale is that is file is present in every pod's file system, and is the recommended way to get the pod's namespace to access namespaced resources from within the pod. It also avoids having to configure the downward API for every operator Deployment.

I also don't think that having a FakeCondition to substitute would help, it doesn't address the portability issue during development. There is an argument to be had about not asserting on a change on the OperatorCondition object, but instead just mocking the call and asserting that a call has been made, but I don't think the API is complicated enough to require this library to ship the stub.

@thetechnick can you elaborate on why a fake Condition is not portable? If you depend on an OperatorCondition being updated by OLM (which isn't discussed in the EP so I don't think this happens) then we either need to implement a fake Condition that can replicate that behavior (as I've described below) or you need to write in-cluster tests with OLM running. @awgreene @ecordell could you weigh in here?

@erdii @thetechnick perhaps calling the condition a "fake" is incorrect, since fake implies usage only during testing. You could easily use the fake Condition when running locally for debugging. WDYT about something like NewLocalCondition(), which returns a Condition that tracks all status updates?

nunnatsa commented 3 years ago

I see I failed to explain why this is needed. It's a bit hard to explain tthe obvious. So let's do the other way around: why is it so problematic to add it? As a user of this package I am telling you that I need a way to override the default namespace source - what is the reason not to allow that?

estroz commented 3 years ago

why is it so problematic to add it

Because it is a poor design decision. The whole reason a Condition interface exists is so that you and I can implement a fake constructor. Assuming your intended solution is implemented, my unit tests now require a conditions CR exists in some test namespace, so now I have to download the conditions CRD then install it in my cluster just for unit testing. This workflow doesn't make sense to me, unless your operator uses the CRD for some other functional reason than writing messages to OLM, in which case this library is not meant for your operator.

As a user of this package I am telling you that I need a way to override the default namespace source

Yes I understand that. I'm just trying to work with you on a compromise that has a better interface and doesn't require external setup. Are you saying your tests cannot, for some good reason, use a fake/local Conditions? Can you elaborate on that reason?

what is the reason not to allow that?

See above.

thetechnick commented 3 years ago

Maybe to step a little bit back here and start from the beginning:

I personally prefer to write tests asserting on the desired outcome, so it doesn't matter which custom code or utility library is used in the implementation to archive the goal. This way it is far more likely that the test can be reused when refactoring the implementation itself.

It is completely valid to stub this library, use a FakeCondition and just assert that functions are called when expected. But there is a catch: With every stubbed/mocked dependency the test case needs to make more and more rigid assumptions about call order, exact arguments and API usage. In my experience this makes test cases very brittle, as implementation changes during refactoring invalidates most of the test case.

As a software maintainer, I would like to have the choice to either mock, stub and fake my way through tests or run the actual code to verify functionality.

This also extends to running the operator outside of a cluster and without OLM, sometimes I need to verify that stuff works as intended manually and in environments that are not identical to the full blown production system. I am required to make these trade-offs constantly or stuff will never get done.

If I need to quickly validate that my operator is working as intended on a duct-taped raspberry pi powered by a box of potato batteries, because that's expedient at the moment, I don't want to be held back by a hard coded file path in 20 lines of library code.

estroz commented 3 years ago

But there is a catch: With every stubbed/mocked dependency the test case needs to make more and more rigid assumptions about call order, exact arguments and API usage. In my experience this makes test cases very brittle, as implementation changes during refactoring invalidates most of the test case.

In general I agree with you. This particular case is interesting though, because you are effectively writing to a list of objects (a conditions CR's status) that is not written to by any other entity than the operator. The conditions CRD status is that simple. Therefore a mock that has an underlying list of conditions would almost exactly replicate in-cluster behavior. Now you could argue that k8s guarantees don't apply so the mock can introduce bugs, but I counter that with: you should be writing e2e tests too, since unit tests are not intended to extensively test in-cluster scenarios.

As a software maintainer, I would like to have the choice to either mock, stub and fake my way through tests or run the actual code to verify functionality.

Ultimately I agree with this statement, and therefore it is probably ok to expose some way to set a namespace as well as provide a mock Conditions constructor. My main issue with doing both is that providing two very different ways to test something is confusing to people, which can be somewhat remedied by documentation saying "use this namespace override only if you know what you're doing, otherwise use this mocker".

estroz commented 3 years ago

If namespace injection is going to be added, I'd also like to see injection happen only when running locally, i.e.

const OperatorNamespaceLocalEnv = "OPERATOR_NAMESPACE_LOCAL"
const podNamespaceFilePath = "/var/run/secrets/kubernetes.io/serviceaccount/namespace"

if _, err := os.Stat(podNamespaceFilePath); err == nil || errors.Is(err, os.ErrExist) {
    return readNSFile()
} else if localNSValue, isSet := os.LookupEnv(OperatorNamespaceLocalEnv); isSet {
    return localNSValue, nil
} else {
    return "", ErrNamespaceNotSet
}

This disallows using this environment variable in production, which prevents incorrect and confusion about usage IMO.

thetechnick commented 3 years ago

@estroz That would be all I need :)

I already wanted to submit a PR to add this, but falling back to the env lookup on error looks quite fragile to me. Unfortunately Go is not abstracting errors across environments (Linux, OS X, Windows?) very nicely, also there are probably a few cases that need special handling:

And any other odd error that could come up in this context...

I completely understand, that you don't want anyone ever to use this in prod, so would you mind making this dead simple? Ceph's --yes-i-really-mean-it cli flag springs to mind :D

const OperatorNamespaceLocalEnv = "OPERATOR_NAMESPACE_OVERRIDE_NEVER_USE_IN_PROD"
const podNamespaceFilePath = "/var/run/secrets/kubernetes.io/serviceaccount/namespace"

if localNSValue, isSet := os.LookupEnv(OperatorNamespaceLocalEnv); isSet && len(localNSValue) > 0 {
    return localNSValue, nil
}
return readNSFile()
joelanford commented 3 years ago

I know I'm really late to this, but I'm the one that had opinions about making this interface less "magic". In pre-1.0 versions of operator-sdk, we supported something similar via environment variables. If I remember correctly, we supported:

  1. OPERATOR_NAMESPACE: to lookup the namespace the operator is running in
  2. OPERATOR_SDK_RUN_LOCAL: to tell us whether we were running locally or not.

Both of these struck me as odd and hard for users to get right.

  1. OPERATOR_NAMESPACE needed to be set explicitly on the operator pod even though it is trivial to read this from the cluster (let me hand wave over use cases like testing for a moment). This is easy to miss or get wrong.
  2. OPERATOR_SDK_RUN_LOCAL just seemed bad all around. What if I set that variable when running in a cluster? It seems incongruous to have an environment variable tell the process something that it could very likely determine on its own.

So with that context in mind, I think our goal for this problem should be:

  1. By default this library should read the namespace from the mounted service account metadata. This is what all end-users expect.
  2. When it is necessary for other use cases (e.g. testing), this library should support an injection point that enables the test to provide a namespace rather than read it from the cluster.
  3. There should be no environment variable supported by this library that conveys where an operator is running.

I think 2) is what we're all discussing. To avoid footguns around the "magic" caused by environment variable side effects we experienced with this in SDK pre-v1, I would propose that we make it possible for library users (not end users) to inject custom namespaces via a code API.

The simplest way to do this would be to make this an exported variable. However, since that's a global variable, it would not be thread safe in concurrent testing scenarios, so I would be against that approach.

A more complex, but improved solution would be for us to add a Factory interface that can build arbitrary Condition objects, and we can provide a InClusterFactory struct that would use the in-cluster implementation.

If code under test is aware only of the Factory abstraction, then authors can swap implementations in and out as necessary to test various aspects of their operator.

These ideas are very similar to the ones proposed by @jmrodri here, and I think something like this is the path we should ultimately take.

Here is a PR I posted with this idea implemented: #58

openshift-bot commented 3 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot commented 3 years ago

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale

nunnatsa commented 3 years ago

@joelanford - this is still a blocker for the implementation of operator condition.

nunnatsa commented 3 years ago

/remove-lifecycle rotten

jmrodri commented 3 years ago

@nunnatsa thanks for staying on top of this. We'll re-prioritize this issue.