contenthash: improve the correctness of needsScan

cyphar commented 1 week ago

Commit f724d6fb0504 ("contenthash: implement proper Linux symlink semantics for needsScan") fixed issues with needScan's handling of symlinks, but the logic used to figure out if a parent path is in the cache was incorrect in a couple of ways:

The optimisation in getFollowSymlinksCallback to avoid looking up / in the cache lead to the callback not being called when we go through /. The upshot is that needsScan(/non-existent-path) would always return true because we didn't check if / has been scanned.
Similarly, the logic of skipping a cache lookup if the path is the same as the current path meant that we skipped the first / lookup (because we start with currentPath=/).
Because needsScan would only store the last good path, cases with symlink jumps to non-existent paths within directories already scanned would result in a re-scan that isn't necessary. Fix this by saving a set of prefix paths we have seen. Note that in combination with (1) and (2), if / has been scanned then needsScan will always return false now (because the / prefix is always checked against).

Fixes: f724d6fb0504 ("contenthash: implement proper Linux symlink semantics for needsScan") Fixes #5042 Signed-off-by: Aleksa Sarai cyphar@cyphar.com

cyphar commented 1 week ago

@tonistiigi Based on my testing, this fixes the issue you described as well as having a few other fixes that should improve other possible performance regressions (all mainly dealing with non-existent files).

I can cook up a few tests that call needsScan and scanPath directly if those are the kind of tests you'd like to add? (It's not really possible to test this issue with Checksum directly.)

cyphar commented 1 week ago

pathSet is the simplest way of implementing the structure needed for the prefix checks. If you feel a more complex structure is needed, let me know -- but because we do path lookups from left-to-right now, in practice the very early ancestors will be at the head of the array of prefixes so I don't think it needs anything more complicated than a simple array.

cyphar commented 1 week ago

I guess another radix tree would work well for this but assuming the length of prefixes array is always expected to be small, the potentially inefficient lookup in includes shouldn't matter for practical cases.

Ah yeah, go-immutable-radix has LongestPrefix which would work for this. But yeah, I think in practice it won't matter (and I'm not sure that it would be better for most cases anyway because this use is not write-few-read-many, it's write-many-read-few, so the copies in each Insert are probably not worth it in practice).

That sgtm, but we could also add some private counters logic using private variables that can be turned on by the tests so we can see how many times the scanning/walking happens for certain conditions and detect if some future change causes the more expensive scanning part happen more often than expected.

Already working on the needsScan tests (hence the draft). I'll add some counters as well while I'm at it, though I suspect that testing needsScan should be sufficient.

moby / buildkit

contenthash: improve the correctness of needsScan #5060