homeport / dyff

/ˈdʏf/ - diff tool for YAML files, and sometimes JSON
MIT License
1.31k stars 64 forks source link

interaction between regular expressions and YAML path segments is unclear #266

Open dhduvall opened 1 year ago

dhduvall commented 1 year ago

This bug has morphed a bit. I had originally tried to do --exclude-regexp metadata\.annotations\.checksum/.*, mistakenly thinking that the dots were part of a string that needed to be escaped in order not to match too many strings, and not the separators between path segments.

But that raises the question of how those two syntaxes interact. For instance, if I want to exclude any item whose path starts with metadata and ends with generation, how do I express that? It seems that metadata.*generation works, but possibly by accident? At least, I wouldn't know how to filter

metadata:
  something:
    something-else:
      generation: 5

distinctly from

metadata.something.something-else.generation: 5

or

metadataXsomethingXsomething-elseXgeneration: 5

There are a similar set of ambiguous patterns using / as the separator, too.

In tracking this down, I noticed that the tests are also missing examples which contain regular expression metacharacters (not that they would have caught this particular problem).

Ambiguity is probably fine 99% of the time, but even having that much documented would be really handy. Unfortunately, that seems like it's getting into manpage territory ....

I'll note with a crazy look in my eye that helm uses (AFAIK completely undocumented) vertical tab (\v) as a separator internally, so if you have, say, a datadog repo configured, which has datadog, datadog-crds, and datadog-operator charts, you can search for \vdatadog/datadog\v to find just the first. Perhaps it might make sense to treat \v as a segment separator here, too, assuming it's not possible to use that character in a YAML key.

vladimir-avinkin commented 1 year ago

I don't quite understand your question and what exactly do you mean by ambiguity, but dyff applies regexps to a go-patch path string. At least i think so

https://github.com/homeport/dyff/blob/86654f8e002a69fdabc87ac954f7d0fa11e6b84b/pkg/dyff/reports.go#L94 (Path.String() returns the string with go-patch style separators) So in your example the correct regexp would be ^/metadata/.+/generation$

The exact behavior should probably be documented somewhere. Unless I'm missing something you literally can't use this feature correctly without perusing the source code

dhduvall commented 1 year ago

So in your example the correct regexp would be ^/metadata/.+/generation$

Yep, that'll work; thanks for helping get me to that point. It should be good enough for my use, since I'm not expecting to run into any ambiguous paths.

The exact behavior should probably be documented somewhere. Unless I'm missing something you literally can't use this feature correctly without perusing the source code

Agreed.

The ambiguity I was talking about is when you have a path which is collapsed into a string with some separator (dots, slashes, it doesn't matter; I spend more time with jq and yq, so I'm used to dots as separators, but also dyff output uses dots by default, so I'd expect that as input by default, too) and a single key that is the same as that path.

Here's a concrete example:

metadata:
  annotations:
    prometheus.io/scrape: true
    prometheus.io:
      scrape: false

(Ignoring the fact that this isn't a valid Kubernetes manifest.) Does metadata/annotations/prometheus.io/scrape, reference the true value or the false value? It looks like it references both (pardon the zsh-ism):

dyff between --exclude-regexp 'annotations/prometheus.io/scrape' \
  =(jq -n '.annotations["prometheus.io/scrape"] |= true | .annotations["prometheus.io"].scrape |= true') \
  =(jq -n '.annotations["prometheus.io/scrape"] |= false | .annotations["prometheus.io"].scrape |= false')

returns no differences. I think I would expect the expression to refer to the false value, but then there needs to be an escaping mechanism for the slash in order to get to the true value.

I had trouble figuring out what "go-patch" means, though. Google got me to RFC6902 (JSON Patch), which relies on RFC6901 (JSON Path), which uses ~1 to represent /, but I tried annotations/prometheus.io~1scrape, and dyff reported both differences.

It looks like dyff uses ytbx to do this processing, and that has no reference to ~ in its source, and doesn't even pretend to implement RFC6901 or RFC6902. This is a touch surprising, since the README explicitly references https://github.com/cppforlife/go-patch (aha, there's "go-patch"), but then never actually uses it. go-patch does implement the RFC6901 escaping mechanism, though it's careful to tell us that it's only "roughly based on" RFC6902. The README does say "output", and not "input", but since there's no mention of input format, maybe someone who read the README would assume it was the same as output?