open-policy-agent / conftest

Write tests against structured configuration data using the Open Policy Agent Rego query language
https://conftest.dev
Other
2.86k stars 304 forks source link

multi-document YAML file issue #106

Closed gwkunze closed 4 years ago

gwkunze commented 5 years ago

Since 0.13 conftest has changed how it handles multi-document YAML documents, it behaves somewhat similar to how conftest works with the --combine-config flag, instead of passing each document individually through the polices, it sends a single array of documents through them.

I definitely understand why this functionality was changed as it now allows you to enforce policies across multiple documents in the yaml. It has had the unfortunate effect of breaking all my policies though, instead of testing for input.metadata.label.foo I now have to test input[x].metadata.label.foo but this is surmountable and to be expected for a major version 0 application.

The problem I ran in to was with testing helm charts, one helm chart in particular yields only a single resource in certain cases, and there doesn't create a multi-document yaml file when run through helm template which in turn means input is sometimes an array of Kubernetes objects and sometimes it is a single object.

It's very hard to write policies for these documents as they can arbitrarily change format. Perhaps, YAML files should always yield an array of documents, even if there is only one?

garethr commented 5 years ago

Mmm, that sounds like a bug rather than an intentional change, but probably wants some closer thought about the separate usecases:

jpreese commented 5 years ago

Do you have an example policy and yaml file that broke behavior for you @gwkunze ? I'd like to get a concrete example and toss it in a test to make sure we get that fixed / make a call on how to implement.

If I recall in 0.13, even multi-file yamls still execute the policy against a single file. In other words, if you have a yaml with a --- separator, each separated yaml is validated individually and you should be able to use input.metadata

You can see that scenario here multi-file test which uses a policy that uses a single inputhere

To add onto @garethr's comment, I was in the same boat with how to represent multi-yaml policies. A multi-yaml policy is a single physical document, but really multiple files. So we sort of have to answer, is a multi-yaml file a single input, or multiple inputs.

I would argue the former, and keep input meaning a document on disk. input[x] means multiple documents on disk. Especially having introduced the ability to pass in multiple files at once.

gwkunze commented 5 years ago

Just a simple example with the following policy:

policy/main.rego

package main

warn[msg] {
  msg = json.marshal(input)
}

and the following two documents: single.yaml

foo: bar

multi.yaml

foo: bar
---
baz: ban

gives the following output:

$ conftest test single.yaml
WARN - single.yaml - {"foo":"bar"}
$ conftest test multi.yaml
WARN - multi.yaml - [{"foo":"bar"},{"baz":"ban"}]
gwkunze commented 5 years ago

Forgot to add:

$ conftest --version
Version: 0.13.0
Commit: 8340065
Date: 2019-09-27T22:40:00Z
jpreese commented 5 years ago

@gwkunze thanks for the examples and the version! Could you try with conftest 0.14?

If its still problematic for you, I'll look into this today.

gwkunze commented 5 years ago

Ahh no, 0.14 it behaves as like it did before 0.13 again:

$ conftest test multi.yaml
WARN - multi.yaml - {"foo":"bar"}
WARN - multi.yaml - {"baz":"ban"}
gwkunze commented 5 years ago

but with --combine-config you do get the slightly inconsistent behaviour:

$ conftest test single.yaml --combine-config
WARN - Combined-configs (multi-file) - {"single.yaml":{"foo":"bar"}}
$ conftest test multi.yaml --combine-config
WARN - Combined-configs (multi-file) - {"multi.yaml":[{"foo":"bar"},{"baz":"ban"}]}
jpreese commented 5 years ago

I think the behavior is intentionally inconsistent. I honestly didn't follow much of the discussion around the --combine-config flag, but the README does point out that using this flag is a breaking change to how its used. So to me, difference of behavior would be expected.

For clarity, v0.14 did fix your issue though, yes?

gwkunze commented 5 years ago

v.014 fixed the issue when not using the combine-configs flag, however --combine-config has the (in my opinion, wrong) behaviour I originally described.

When using conftest against a yaml file (for example, one generated by the helm template command) conftest (with --combine-config) parses files differently depending on whether it's a single or multi document yaml, yielding a object or array of objects respectively.

You could technically get around it by having a rule like:

warn[msg] {
  is_array(input[file])
  warning[msg] with input as input[file][_]
}

warn[msg] {
  not is_array(input[file])
  warning[msg] with input as input[file]
}

(and similar rules for deny), but you'd have to give all your rules the different name (warning and whatever you pick for deny rules) and it'd still be somewhat ambiguous as the following two different documents would yield the same input in conftest:

foo: bar
---
baz: ban
- foo: bar
- baz: ban

Although, such wildly different schemas would be a lot less likely scenario and shouldn't really happen in the real world.

What it really comes down to is that the multi-document yaml format doesn't cleanly map to conftest's internal model.

Resolving this is tricky, I think the best solution would probably be to consider a multi-document yaml file as actually being multiple files, so you'd something like the following:

Without --combine-configs the output would look something like:

(without --combine-configs)

[
    {
        "filename": "multi.yaml/doc-1",
        "Warnings": [
            "{\"foo\":\"bar\"}"
        ],
        "Failures": [],
        "Successes": []
    },
    {
        "filename": "multi.yaml/doc-2",
        "Warnings": [
            "{\"baz\":\"ban\"}"
        ],
        "Failures": [],
        "Successes": []
    }
]

(note the suffixes in the filename)

For --combine-configs the input could then look like:

{
    "multi.yaml/doc-1": {"foo": "bar"},
    "multi.yaml/doc-2": {"baz": "ban"}
}
gwkunze commented 5 years ago

This is specific to helm, but the helm template command generates a comment for each sub document, something like # Source: <filename> which is the original filename the document came from. I believe kubeval already respects this as the original filename.

jpreese commented 5 years ago

Yeah I've definitely contemplated whether or not multi-file yamls should be multiple inputs or not. At least for 0.14, I wanted to keep the behavior consistent so things didn't explode.

My gut leans towards multi-file yamls being a single input, and combine-config, obviously, being multiple inputs. I think that is easy to explain and reason about.

I don't really like the idea of having to differentiate in my Rego syntax between single file and multi-file yaml configs, because with things like kustomize you might not know until build time, and the policies won't change.

nicolasbernard commented 5 years ago

I think multi-doc support is mandatory and multi-file is usefull (I want to check that each deployment has a pdb for example).

The easy way is to standardize the worst case: input[file][index]. file is the filename, single or multi files, don't matter, we can even remove the combine flag. index is the document index in multi-doc file, always 0 in mono-doc files.

pro: handle every needs. con: more complicated for beginners and simple needs.

I think it's an UX debate.

gwkunze commented 5 years ago

@nicolasbernard yes, that makes sense, but since conftest has different behaviour for different file formats (hcl, json, yaml) anyway, it might be worth considering only doing this for files that support multi-document, which as far as I know is currently only YAML.

nicolasbernard commented 5 years ago

I don't know, it opens new formats, like jsonl and you can have a static 0 for mono-doc formats.

What about combining multiples files in differents formats? (Seems a nightmare to me and I havn't usecases ;))

garethr commented 5 years ago

I think the problem with:

consider[ing] a multi-document yaml file as actually being multiple files

Is that we lose information that may be relevant in some cases. I'd err on the side of flexibility I think with conftest being such a general purpose tool.

gwkunze commented 5 years ago

An option, which would be quite backwards incompatible is to change the format of combine-config's functionality, turning the input into an array instead of an object gives you more flexibility:

[
  {
    "filename": "foo.json",
    "contents": {"foo": "bar"}
  },
  {
    "filename": "multi.yaml",
    "document": 0,
    "contents": [1,2,3]
  },
  {
    "filename": "multi.yaml",
    "document": 1,
    "contents": {"baz": "ban"}
  }
]

when working with inputs like this you could still do queries for all documents in a file:

deny[msg] {
    input[index].filename == "multi.yaml"
    input[index].contents[1] == 2
}

but the object representing the file could also contain other fields that might be interesting, for example the type of the file (if it's not inferred from the extension), if a file format supports multiple ways of deserialization you might provide both, you could even version deserialization formats and allow the policy to choose either. Other metadata could also be added, either by command line or inferred like last modification dates and other filesystem properties.

This is a rather significant change though...

gwkunze commented 4 years ago

In the interest of reducing the cognitive load of reading this entire thread here is a summary of what's discussed in this rather long thread:

Summary

Conftest currently has inconsistent behaviour dealing with multi-document yaml files which only crops up when using the --combine flag. In non---combine-mode it treats each individual document in a yaml file as if it was a separate single-document file. i.e. conftest test multi-doc.yaml is the same as conftest test doc1.yaml doc2.yaml ... docn.yaml (assuming all the doc#.yaml files make up multi-doc.yaml).

In --combine mode, the input is an object where the keys are the filename and the values are the parsed contents of the file, here is where the problem appears: in the case of a single-document yaml the document is parsed and made the value of the file's key but in a multi-document yaml, the yaml object is suddenly parsed as an array of objects, i.e. the following document:

foo: bar
baz: ban
---
foobar: bazban

results in the following input document:

{
  "multi-doc.yaml": [
    {
      "foo": "bar",
      "baz": "ban"
    },
    {
      "foobar": "bazban"
    }
  ]
}

but if the second document were to be removed from the yaml, suddenly the object is no longer an array and just:

{
  "multi-doc.yaml": {
    "foo": "baz",
    "baz": "ban"
  }
}

It is impossible to write policies that handle either case consistently.

Solutions

Treat all file formats as multi-document (ref)

Treat all file formats regardless of whether they support multi-document content as multi-documents. This would mean the input document is always an object where the keys are filenames and the values are arrays of one or more parsed documents. Probably the simplest solution but requires all (--combine) policies to always take an extra array index operation into account.

Always treat yaml files as multi document (ref)

Here a parsed yaml file would always be treated as a multi-document yaml file and therefore always parsed as an array of objects. This is a relatively simple solution to the problem. It does have the problem of being slightly more complicated for users. Also an example where this can lead to problems is with kubernetes manifests, which can be JSON or YAML and you'd still run into the original problem where JSON is treated as a single document while YAML is an array of documents (Technically you could differentiate based on the extension of the filename but that feels rather anti-patternish).

Alter the input format to support multi-document formats (ref)

Change the input document format to treat documents as the basic element instead of files. By changing it to something like:

[
  {
    "filename": "foo.json",
    "type": "json",
    "contents": {"foo": "bar", "baz": "ban"},
    "document": 0
  },
  {
    "filename": "multi-doc.yaml",
    "type": "yaml",
    "contents": {"foo": "bar", "baz": "ban"},
    "document": 0
  },
  {
    "filename": "multi-doc.yaml",
    "type": "yaml",
    "contents": {"foobar": "bazban"},
    "document": 1
  }
]

Allows policies to ignore the fact whether a document is multi-document or not, you just get documents via input[x].contents instead of input[y] where y is the filename and x is a document-index across all passed-in files, multi-document or not. The filename still being available as input[x].filename. This allows for more meta-data fields like the type in the example above. There is a bunch of potential advantages with this approach as mentioned in the post above. However implementing this would require any policies written for the --combine flag to be altered.

jremond commented 4 years ago

Hello,

This is of high interest for us as we parse a lot of YAML files. We use both combined and non-combined options right now and we maintain policies for both input formats. What would be required to have a decision based on @gwkunze proposals ?

Thanks

mark-rushakoff commented 4 years ago

I'm working around this by always using conftest test --combine even when I am processing only a single YAML file. And then, I collect all my Kubernetes resources in a single array named resources:

resources := [res | fileRes := input[_]; res = fileRes[_]]

And I write all my rules against resources rather than input. If I had existing rules written against input I'm not sure what I would do, as reassigning input as a function of input is recursive and disallowed. Maybe with input as resources would work.

Salamandastron1 commented 4 years ago

So the quick lean and skinny of this is --combine needs more evidence of being used to convince a breaking change across the entire API, as these solutions propose. @mark-rushakoff @jremond @gwkunze Could you all make pull requests against the contrib folder in this repo so there is more details in how and why you using it?

jremond commented 4 years ago

@Salamandastron1 Hello, here is a simplified example https://github.com/instrumenta/conftest/pull/265 conftest test --combine *.yaml allows me to know 2 things :

gwkunze commented 4 years ago

Added example in #278

Another example could be (but would be a bit boring to contribute) is that we run conftest against the output of helm charts and we want to ensure the existence of a specific resource in all our helm charts. I.e. we want every deployed application to contain at least an Application resource describing the deployed application. Without the --combine flag that wouldn't be possible to check.

jcmcken commented 4 years ago

How does --combine affect a mix of YAML files where some are multi-document and some aren't? If multi-document means input[filename] is an array, but single-document means it's an object, that makes writing policies a bit more difficult.

For example let's say I have two Ingress documents in one file and only one in another, and I use --combine and want to write rules to gather the ingresses. Do I now have to do something like this?

ingresses[ingress] {
  input[filename]["kind"] == "Ingress"
  ingress := input[filename]
}

ingresses[ingress] {
  input[filename][index]["kind"] == "Ingress"
  ingress := input[filename][index]
}

(Pardon if this is not correctly written, but you see what I mean)

It seems kind of annoying to have to do something like this

What would make more sense is if input[filename] were always an array of documents, either of length 1 or length n (in the multi-document case). That would simplify the policy

garethr commented 4 years ago

@jcmcken What you propose though would make the common case more complicated.

I definitely agree it's a trade off, consistency here would make the complex case you describe better, but I think has a cost. Pushing that into rego doesn't see too bad to me.

Definitely worth exploring options here though.

jremond commented 4 years ago

If --combine is used, I would expect the input to be different (as it is) but consistent while using single-document or multiple-document files. Always having an array of documents when using --combine looks good to me.

jcmcken commented 4 years ago

I think the duplication is not the biggest issue ever. If the documentation were updated with examples that discuss this issue, I think that would be OK in my book. It was just a confusing behavior to come across.

OmegaVVeapon commented 4 years ago

Has there been any developments on this conversation?

I'm currently at a point where I have 3 k8s policies. 1 of them requires the --combine flag to work, the other 2 only need to check one single k8s manifest, so I'm letting conftest break the YAMLs into different objects like it does by default in conftest test helm-rendered-template.yaml.

If I rewrite my other 2 policies to work with the --combine flag, they will probably continue to work regardless of the decision made in this ticket, right?

Also, just want to add my vote on changing --combine to always create an array of documents for consistency. I think that's a no brainer.

The discussion on weather making the --combine flag be the new default is definitely more complicated though...

jehiah commented 4 years ago

I've also run into this and would like to propose another option which would allow preserving existing functionality in backwards compatable format

A new --multi-document flag that would always present the contents of a document in a "multi-document" format (array) and would apply regardless of --combine.

So given the following two documents

# single.yaml
lorem: ipsum

# multi-doc.yaml
foo: bar
---
baz: ban

--multi-document would present them both as arrays (with or without use of --combine.

[
  {"foo": "bar"},
  {"baz": "ban"},
]
[
  {"lorem": "ipsum"}
]
{
  "multi-doc.yaml": [
    {"foo": "bar"},
    {"baz": "ban"},
  ],
  "single.yaml":[
    {"lorem": "ipsum"}
  ]
}

@jpreese if this approach sounds good i'll be happy to take a pass at implementing it

OmegaVVeapon commented 4 years ago

@jehiah that approach sounds to me like what the --combine flag should have done from the start.

Question though... if you do get the green light to implement that, is my understanding correct that the only difference between --multi-document and --combine will be that you'll provide an array for single.yaml instead of the current implementation for --combine?

For multi-doc.yaml, both flags will treat them the same.

jehiah commented 4 years ago

@OmegaVVeapon here is what i'm proposing more specifically (using the setup pattern from @gwkunze in https://github.com/open-policy-agent/conftest/issues/106#issuecomment-540048796)

Setup

cat - >main.rego <<EOF
package main

warn[msg] {
  msg = json.marshal(input)
}
EOF
cat - >multi-doc.yaml <<EOF
foo: bar
---
baz: ban
EOF
echo "foo: bar" > single.yaml

Current functionality (as of 125160deacb9c02ce3c098bdf1f3ce7df216026a)

$ conftest test -p main.rego *.yaml 
WARN - multi-doc.yaml - {"foo":"bar"}
WARN - multi-doc.yaml - {"baz":"ban"}
WARN - single.yaml - {"foo":"bar"}
$ conftest test --combine -p main.rego *.yaml
WARN - Combined - {"multi-doc.yaml":[{"foo":"bar"},{"baz":"ban"}],"single.yaml":{"foo":"bar"}}

Proposed (with --multi-document)

$ conftest test --multi-document -p main.rego *.yaml
WARN - multi-doc.yaml - [{"foo":"bar"},{"baz":"ban"}]
WARN - single.yaml - [{"foo":"bar"}]

$ conftest test --combine --multi-document -p main.rego *.yaml
WARN - Combined - {"multi-doc.yaml":[{"foo":"bar"},{"baz":"ban"}],"single.yaml":[{"foo":"bar"}]}
OmegaVVeapon commented 4 years ago

@jehiah That made it crystal clear. Thanks!

jehiah commented 4 years ago

I'm not in love with --multi-document naming and i think this situation only applies to yaml files, so other possible naming could be along the lines of --yaml-multi-doc-mode "treat yaml files containing multiple documents as a single input"

jpreese commented 4 years ago

Thanks for keeping the discussion going, everyone! I read through it and I'm not sure we need to introduce another argument.

A .yaml file can either be a single yaml file, or a number of yaml files split by ---. With conftest, when a --- is present, conftest will do some parsing under the hood to consider these independent inputs. In other words, the policies that you write don't need to change regardless of if the .yaml file contains one or many configurations. The JSON will be different, but the policy should not need to change.

Now for combined, if all of the above makes sense, it might just be that we need to run the same parsing that we do for non-combined files, that we do for combined. That sounds like the root problem here.

The logic for that parsing is here: https://github.com/open-policy-agent/conftest/blob/master/policy/engine.go#L35.

Notice that we range over all of the files, and then range again if that file has multiple configurations.

But when running --combined, we don't perform that subconfig check: https://github.com/open-policy-agent/conftest/blob/master/policy/engine.go#L70

Am I understanding all of that correctly?

jehiah commented 4 years ago

@jpreese I have two motivations; 1) a way to write a policy that applies for single and multi-doc yaml files (this isn't possible with --combined currently because the format is sometimes a doc directly, sometimes an array of documents. 2) write a policy that references across yaml documents in the same file (this requires --combined currently).

For a concrete case, i want a policy that checks kubernetes deployment files; sometimes those are single doc files, sometimes they include additional docs for things like autoscalaers etc. I want to be able to check those settings in relation to each other.

When running --combined skipping the subconfig check is what presents an array of documents for a multi-doc file today; if that's omitted what happens? The structure is keyed by file so one alternative could be appending the index to the filename as @gwkunze suggested here https://github.com/open-policy-agent/conftest/issues/106#issuecomment-540438590

That proposal would result in the following output

$ conftest test --combine --multi-document -p main.rego *.yaml
WARN - Combined - {"multi-doc.yaml/1":{"foo":"bar"},"multi-doc.yaml/2":{"baz":"ban"},"single.yaml":{"foo":"bar"}}
jpreese commented 4 years ago

I'll play with some options tonight, but my current thoughts are:

--combine

my:val +++ foo:bar

input[0].my input [1].foo

Disclaimer: on mobile :rofl:

jpreese commented 4 years ago

@jehiah this is along the lines of what I was thinking: https://github.com/open-policy-agent/conftest/pull/388

You can see the examples folder and the acceptance.bats showcasing some of the usages. If you're wanting to take it for a spin, feel free to checkout my branch locally and run your policy tests against that version of conftest.

jpreese commented 4 years ago

@OmegaVVeapon @jehiah @jcmcken @gwkunze @jremond @Blokje5 A pull request/branch is available that implements an approach for multi-document yaml files (or any file type)

https://github.com/open-policy-agent/conftest/pull/388

jpreese commented 4 years ago

Resolved via https://github.com/open-policy-agent/conftest/pull/388. Thanks for the ongoing discussions everyone!

jcmcken commented 4 years ago

Sorry I dropped out of this discussion, haven't had much time to track it.

Honestly, I'm a little surprised in the turn this took. I was hoping that if a breaking change was made, the old multi-document struct would just become the default. That way all I would need to do is delete the single document rules from my policy (or I could even leave it in there and it would still work I think). Now some 'magic keywords' are introduced and the entire struct is different requiring much more effort to convert. I guess I'm not following from this discussion (or in the PR) why the decision to go with the current method was made.