moby / buildkit

concurrent, cache-efficient, and Dockerfile-agnostic builder toolkit
https://github.com/moby/moby/issues/34227
Apache License 2.0
8.21k stars 1.16k forks source link

Cache Manifest: duplicate entries in a manifest ? Is a sha an identifier ? #4398

Open azr opened 1 year ago

azr commented 1 year ago

Hello there, I'm currently trying to deeply understand what's in a cache manifest and I have a bunch of questions, feel free to just redirect me to a blog post or something, I couldn't find any that would answer these.

1/ In some builds a cache manifest can have more than one layer with the same digest, but I see that the code treats those as a single, same thing; for example in here : https://github.com/moby/buildkit/blob/36c5550a84d5ef558c1b9ac733cba128feb37929/cache/remotecache/import.go#L76-L79

Here are some questions:

2/ Same paragraph and same questions but for cache records instead of layers. A cache manifest can have more than one records with a similar digest.

3/ cacheManager.Query

https://github.com/moby/buildkit/blob/36c5550a84d5ef558c1b9ac733cba128feb37929/cache/remotecache/v1/cachestorage.go#L47-L52

When the cache item has links. But, would it be possible to rename things before saving the manifest ? So as to have each entry be a unique one ?

Thanks :)

tonistiigi commented 1 year ago

Start from https://github.com/moby/buildkit/blob/master/cache/remotecache/v1/doc.go if you haven't already.

Duplicate digests may appear but duplicate chains (meaning elements where all the other properties are same as well) should not. Unfortunately, I think I've seen some (unverified) reports of instances with duplictes so it is possible there is a bug somewhere. In that case feel free to tackle it or if you can put together a reproducer that creates duplicates then make sure to post it.

Layer digest is not a cache digest. Chains of cache digests point to some layer chain.

Would it be a correct thing to sort of rewrite a manifest and merge the duplicate entries together ?

This should already happen. For example look at CacheChains.normalize()

I also saw that upon loading a cache manifest, the loaded digests are randomized here :

This is an internal representation for testing uniqueness. These IDs do not end up in JSON manifest.

azr commented 1 year ago

Hey @tonistiigi, thanks for all these, and sorry for not replying sooner, the notification somehow slipped through, although I was looking forward 👀.

I managed to have a chain that had a part repeating itself, but it repeated only once (not infinitely), and it made sense then looking at the Dockerfile, I think. 🤔 Nothing weird since.

I did build a thing that would generate a dot file from a cache Manifest, to get a visual graph of the manifest, which helped a lot + a few other things that helped.

I managed to answer my questions, to build the thing I wanted to build, a global s3 cache as described in here #4295 would you be interested in me opening a PR with it ? Would you also be interested in seeing the dot generator ?

Thanks !

tonistiigi commented 1 year ago

would you be interested in me opening a PR with it ? Would you also be interested in seeing the dot generator ?

You can open a PR and we can discuss there. Make it a draft if you are not sure what direction to take it. For "dot generator" as well I need to see more details, like can it accept the name of cache image or what is the interface. Generally, we are of course interested in more debugging tools.

jedevc commented 11 months ago

Would you also be interested in seeing the dot generator ?

100%. I've been doing some not very fun cache debugging recently, and any kind of tooling around this would be great.

Even if you don't have the time to do a PR right now, if you have anything at all that's public, I'd potentially be happy to collaborate and help upstream it?

azr commented 11 months ago

I have pretty handy tooling around here, let me quickly PR it.

azr commented 11 months ago

Feel free to use https://github.com/moby/buildkit/pull/4470