buildah-remote task: concurrency problems with source workspace

owtaylor commented 3 months ago

With the buildah-remote task, there are problems if several are run at the same time with the same workspace. For example, I saw:

arch1 completes, $(workspaces.source.path)/rhtap-final-image is rsynced back from the arch1 remote
arch2 starts, rhtap-final-image is rsynced to the arch2 remote
The arch2 image is added to rhtap-final-image
arch1 completes, rhtap-final-image is rsynced back from the arch2 remote - with both architectures
buildah pull oci:rhtap-final-image fails because the directory now has two images and buildah doesn't know which to use

But there are all sorts of other failures and corruptions that can occur. The workaround people have used for this is to use separate workspaces and duplicate the prefetch-dependencies steps. That works, with the downsides:

You have to know to use that workaround and why
A more complex, harder to maintain pipeline
Right now, cachi2 can't prefetch for only one architecture, making this approach much less efficient.

As far as I can tell, this could be fixed pretty simply by making the buildah/buildah-remote tasks align more closely with the buildah-remote-oci-tasks - use an emptydir workdir for temporary files rather than doing that in the workspace.

brianwcook commented 3 months ago

is this somehow specific to using prefetch? I haven't had any of these concurrency problems in my multi-arch pipelines, but I usually do not have prefetch turned there.

owtaylor commented 3 months ago

is this somehow specific to using prefetch? I haven't had any of these concurrency problems in my multi-arch pipelines, but I usually do not have prefetch turned there.

My guess is that your multi-arch pipelines use separate workspaces. I only hit this because I wasn't following a well tested example. The downside of using separate workspaces (other than complexity) is prefetch specific, and especially when the prefetch is large.

lcarva commented 3 months ago

Maybe the fix is to move to using Trusted Artifacts?

owtaylor commented 3 months ago

Maybe the fix is to move to using Trusted Artifacts?

From discussion with @arewm and @brianwcook yesterday, it sounded like they thought keeping the non-oci-ta variants around was going to be necessary for projects with "very large" artifacts.

Certainly for the flatpak-runtime / flatpak-sdk images I'm working on, I have some doubts about the performance and storage usage of putting the prefetched RPMs (> 10GB for all architectures + source) into a registry artifact, though I haven't tried it yet.

brianwcook commented 3 months ago

Ah, I am pretty much exclusively using trusted artifacts so that makes sense.

brianwcook commented 3 months ago

Maybe the fix is to move to using Trusted Artifacts?

From discussion with @arewm and @brianwcook yesterday, it sounded like they thought keeping the non-oci-ta variants around was going to be necessary for projects with "very large" artifacts.

Certainly for the flatpak-runtime / flatpak-sdk images I'm working on, I have some doubts about the performance and storage usage of putting the prefetched RPMs (> 10GB for all architectures + source) into a registry artifact, though I haven't tried it yet.

so 10GB / 4 Arch = 2.5GB per arch ... I think there is a reasonable chance you will have a better experience using trusteed artifacts instead of PVC workspaces.

brianwcook commented 3 months ago

@arewm will the Matrix feature be able to reduce the complexity of using multiple prefetch tasks & workspaces?

owtaylor commented 3 months ago

so 10GB / 4 Arch = 2.5GB per arch ... I think there is a reasonable chance you will have a better experience using trusteed artifacts instead of PVC workspaces.

If we can't split the prefetch[^1] by arch and source then a single PVC workspace is clearly way better than downloading 10GB 4 times to use 2.5G of it. If we can split it, then both the Trusted Artifact and PVC versions benefit from that - the PVC version will still avoid pushing and pulling 10GB to the registry and leaving it there indefinitely.

Anyways, I filed an issue here to publicly record the issue and a possible fix. Could also do something like:

diff --git a/task/buildah-remote/0.2/buildah-remote.yaml b/task/buildah-remote/0.2/buildah-remote.yaml
index 2c5ae78..7c793c5 100644
--- a/task/buildah-remote/0.2/buildah-remote.yaml
+++ b/task/buildah-remote/0.2/buildah-remote.yaml
@@ -447,6 +447,10 @@ spec:
        -v "$BUILD_DIR/tekton-results/:/tekton/results:Z" \
        -v $BUILD_DIR/scripts:/script:Z \
       --user=0  --rm  "$BUILDER_IMAGE" /script/script-build.sh
+      if [ -d rhtap-final-image ] ; then
+          echo "rhtap-final-image already exists, multiple buildah-remote tasks must use separate workspaces"
+          exit 1
+      fi
       rsync -ra "$SSH_HOST:$BUILD_DIR/workspaces/source/" "$(workspaces.source.path)/"
       rsync -ra "$SSH_HOST:$BUILD_DIR/volumes/shared/" /shared/
       rsync -ra "$SSH_HOST:$BUILD_DIR/tekton-results/" "/tekton/results/"

[^1]: Splitting a RPM prefetch should be doable without any cachi2 changes, though clumsy, by using separate rpms.lock.yaml in separate subdirectories of the source. Haven't tested that yet.

arewm commented 3 months ago

Matrix builds are defined by setting parameters for task runs so I don't think that it would support associating specific workspaces for each of the runs. The challenge here is that the workspace exists outside of just the taskrun as you need to specify each of the shared workspaces for the entire pipeline run.

It seems like it could be reasonable to use a matrix for prefetching as well as builds using TA. If the OCI trusted artifact is pushed with some architecture specification in the prefetch then it should be possible for the consumer to identify the proper arch match for its needs.

arewm commented 3 months ago

I have heard that it was an intentional decision to have a single prefetch task in the pipeline and not farm it out to multiple arch-specific ones. I am not familiar with the specifics of that decision, however.

@brunoapimentel, are you aware or do you know who would be able to shed light on the current implementation?

brianwcook commented 3 months ago

I don't know why it is the way it is either, but I am really curious to see how much stuff breaks if we did do per-arch prefetch so I added it to a fork of cachi2

https://github.com/brianwcook/cachi2/commit/707a7d053ac6c28627fd55e0c475058572802ec6

chmeliik commented 3 months ago

I don't know why it is the way it is either, but I am really curious to see how much stuff breaks if we did do per-arch prefetch so I added it to a fork of cachi2

brianwcook/cachi2@707a7d0

Probably just source containers (they won't break, but they'll potentially be missing sources)

And for everything except RPMs, running prefetch more than once is a waste

owtaylor commented 3 months ago

I don't know why it is the way it is either, but I am really curious to see how much stuff breaks if we did do per-arch prefetch so I added it to a fork of cachi2

https://github.com/brianwcook/cachi2/commit/707a7d053ac6c28627fd55e0c475058572802ec6

Cool. I think it might make more sense to do it as a command line option, so the input definition can be shared among tasks. When I prototyped implementing split prefetch using separate rpms.lock.in in subdirs, it was quite awkward to use a different input per prefetch - I could no longer just define the prefetch input for the entire pipeline - there's no way I know to say "Use params.prefetch-input subsituting @ARCH@ with $arch".

Probably just source containers (they won't break, but they'll potentially be missing sources)

I think you'd do the source prefetch separately - either by having cachi2 merge all arches[].source or by changing the rpms.lock.yaml format so that source packages were listed separately from the architectures.

And for everything except RPMs, running prefetch more than once is a waste

There are probably sticky cases where you want to download a lot of RPMs, and a lot of something else, but the worst case for RPM prefetch is something like a base image where you have gigabytes of RPMs and nothing else. I'll collect stats for size and performance when I finish Flatpak runtime and SDK and post that somewhere - may give some idea of when split prefetch might be useful, if at all.

brianwcook commented 3 months ago

I did add it as a command line option (cachi2 taks a json payload on the cli).

I'm not sure if it makes more sense to have 4 prefetch tasks, or to have 1 prefetch task that produces a cache which is well separated so that it can be transferred separately (maybe this wouldn't break source containers, but it is definitely much more complicated to implement).

--

BRIAN COOK (he/him/his)

Sr. Principal Product Manager

Konflux & RHT Product Build Systems

On Mon, Aug 12, 2024 at 9:22 AM Owen Taylor @.***> wrote:

I don't know why it is the way it is either, but I am really curious to see how much stuff breaks if we did do per-arch prefetch so I added it to a fork of cachi2

@.*** https://github.com/brianwcook/cachi2/commit/707a7d053ac6c28627fd55e0c475058572802ec6

Cool. I think it might make more sense to do it as a command line option, so the input definition can be shared among tasks. When I prototyped implementing split prefetch using separate rpms.lock.in in subdirs, it was quite awkward to use a different input per prefetch - I could no longer just define the prefetch input for the entire pipeline - there's no way I know to say "Use params.prefetch-input subsituting @arch https://github.com/arch@ with $arch".

Probably just source containers (they won't break, but they'll potentially be missing sources)

I think you'd do the source prefetch separately - either by having cachi2 merge all arches[].source or by changing the rpms.lock.yaml format so that source packages were listed separately from the architectures.

And for everything except RPMs, running prefetch more than once is a waste

There are probably sticky cases where you want to download a lot of RPMs, and a lot of something else, but the worst case for RPM prefetch is something like a base image where you have gigabytes of RPMs and nothing else. I'll collect stats for size and performance when I finish Flatpak runtime and SDK and post that somewhere - may give some idea of when split prefetch might be useful, if at all.

— Reply to this email directly, view it on GitHub https://github.com/konflux-ci/build-definitions/issues/1275#issuecomment-2283980010, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA4DQJ3PT6V3EHQ47JUL37DZRCZI5AVCNFSM6AAAAABMIT5UTSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOBTHE4DAMBRGA . You are receiving this because you were mentioned.Message ID: @.***>

owtaylor commented 3 months ago

Here's some timings for the two Flatpak images we produce the runtime (common dependencies) and SDK (extended runtime with compilers, headers, etc.)

	Size (MiB)	Prefetch (s)	Create/Upload (s)	Restore (s)	rsync (s)
flatpak-runtime - amd64	295	40	35	9	4
flatpak-runtime - arm64	271	40	30	7	2
flatpak-runtime - ppc64le	281	44	31	7	6
flatpak-runtime - s390x	285	33	24	9	6
flatpak-runtime - source	2719	130	257	42	---
Total:	3851	287	377	74	18

flatpak-sdk - amd64	572	545	61	18	4
flatpak-sdk - arm64	536	528	50	20	3
flatpak-sdk - ppc64le	546	528	52	25	11
flatpak-sdk - s390x	544	530	48	13	11
flatpak-sdk - source	3552	888	336	159	---
Total:	5750	3019	547	235	29

The timings are a bit all over the place - especially for the prefetch step. (The network connection between the Konflux cluster and the internal repositories prefetch is pulling from seems to be the bottleneck.) But it gives a a general idea.

My observations:

For these images, splitting the prefetch is quite useful, because prefetch and creating the artifacts is slow. Restoring the RPMs from the registry on the other hand is pretty fast - it took almost 3 minute for the SDK sources, but that was a bit of an outlier.
For these images, duplicated prefetch like is done by other RHEL containers wouldn't work at all.
For images with less RPM content, splitting the prefetch doesn't matter as much - wouldn't be bad to save a few minutes per build, though.

owtaylor commented 3 months ago

Cool. I think it might make more sense to do it as a command line option, so the input definition can be shared among tasks. When I prototyped implementing split prefetch using separate rpms.lock.in in subdirs, it was quite awkward to use a different input per prefetch - I could no longer just define the prefetch input for the entire pipeline - there's no way I know to say "Use params.prefetch-input subsituting https://github.com/arch@ with $arch". I did add it as a command line option (cachi2 taks a json payload on the cli).

Well, OK :-) , but what I mean is a command line option that's separate from the input specification, since the input specification is something that the component should be able to control without modifying individual pipeline taskruns.

arewm commented 3 months ago

I'm not sure if it makes more sense to have 4 prefetch tasks, or to have 1 prefetch task that produces a cache which is well separated so that it can be transferred separately (maybe this wouldn't break source containers, but it is definitely much more complicated to implement).

There are two goals that are at odds with each other:

Reducing the amount of content that is pulled by the build container if prefetches are arch-specific (i.e. multiple artifacts)
Ensuring that all prefetched content is present for generating the source container (i.e. a single artifact)

I haven't looked into how the directory is structured when prefetching content, but a potential high-level overview of the required changes is:

Define a parameter for prefetch which would separate out the platform-specific dependencies into their own subdirectories
Teach the TA pushing to create a composite artifact (i.e. an image manifest with one layer for each architecture) upon request or based on the directory structure in the file system. If there is no arch-specific content, a single layer would be pushed.
Teach the TA pulling to be able to inspect the image manifest and only pull a specific layer if all content isn't needed. If there isn't any arch specific content, the single layer would be pulled.
Enable the buildah task to modify the TA pull to provide an architecture (i.e. the last slug of the PLATFORM), pulling only that content (we will need to ensure that each of the configured platforms support this mode of operation)
Tasks like the source build wouldn't know/care about the arch-specific content, so they would just pull all layers. These tasks would, however, need to be able to handle the modified directory structure (if present).
- NOTE: An alternative to this could just be to generate the source containers at the same time as the prefetch. The challenge here is that we wouldn't be able to use the referrer's API directly because the subject doesn't exist. There might need to be a process in the build task (or elsewhere in the pipeline) that attaches artifacts which are unattached.

@mmorhun @zregvart , fyi as well to bring your attention to this discussion.

zregvart commented 3 months ago

Reducing the amount of content that is pulled by the build container if prefetches are arch-specific (i.e. multiple artifacts)

Ensuring that all prefetched content is present for generating the source container (i.e. a single artifact)

Not sure why we could not generate a prefetch trusted artifact per-platform? A single source and platform specific trusted artifact can be used in per-platform build step, and restoring the single source and multiple per-platform trusted artifacts when building the source container. This all relies on destinations for per-platform prefetch trusted artifacts not overwriting each other, i.e. being created and restored to platform-specific directories.

chmeliik commented 3 months ago

Reducing the amount of content that is pulled by the build container if prefetches are arch-specific (i.e. multiple artifacts)

Ensuring that all prefetched content is present for generating the source container (i.e. a single artifact)

Not sure why we could not generate a prefetch trusted artifact per-platform? A single source and platform specific trusted artifact can be used in per-platform build step, and restoring the single source and multiple per-platform trusted artifacts when building the source container. This all relies on destinations for per-platform prefetch trusted artifacts not overwriting each other, i.e. being created and restored to platform-specific directories.

That does sound like a better idea. But if it also needs to work for the non-TA pipeline, it complicates things.

Most multiarch pipelines I've seen use one workspace per arch and run one prefetch per arch (I don't fully understand why, but I think it was due to the volumes being ReadWriteOnce - not possible to run builds in parallel with a single workspace). Which would mean the source-container task needs to mount a variable number of workspaces.

In practice, the way to achieve this would be to declare a hardcoded number of workspaces in the source task (let's say 10) and Konflux users would have to pass their per-arch workspaces explicitly, something like this

- name: workspace-0
  workspace: workspace-x86_64
- name: workspace-1
  workspace: workspace-aarch64
- name: workspace-2
  workspace: workspace-ppc64le
- name: workspace-3
  workspace: workspace-s390x

Side note: the only reason why this is currently not necessary is that it's almost impossible to make the prefetch task behave differently per arch, except for Owen's workaround:

Splitting a RPM prefetch should be doable without any cachi2 changes, though clumsy, by using separate rpms.lock.yaml in separate subdirectories of the source.

arewm commented 3 months ago

Most of the multi-arch pipelines have a single PVC per workspace to prevent the current issue ... preventing the unnecessary copying/syncing of data to remote workers for the other architectures.

In my proposed pipeline for multi-arch with matrix (and OCI trusted artifacts), none of these workspaces exist: https://github.com/konflux-ci/build-definitions/pull/1236.

While the requested feature to split the prefetch can be done when the data is stored in PVC, there is a more pressing need for it just in the OCI trusted artifacts because of the additional data transfer to/from the registry (this cost is not realized with the PVC except for when we rsync data to the remote host).

chmeliik commented 3 months ago

I kind of lost track of what we're discussing in this issue

lkolacek commented 3 weeks ago

I also lost track of the issue we're discussing here and the discussion itself seems to be quiet for a long time. Is it possible that the issue is not relevant anymore with moving to Trusted artifacts pipeline? I'll close it in a week if there won't be any response.

arewm commented 3 weeks ago

I think that the original issue that was described (i.e. not having uniqueness in the images when synced back to a shared PVC) might have been incidentally/unintentionally resolved with https://github.com/konflux-ci/build-definitions/pull/1216.

We now push and pull the image which automatically has the platform appended to it.

The resulting conversation is unresolved ... i.e. how should we properly handle/design prefetching large payloads for multiple architectures.

mmorhun commented 2 weeks ago

Should we have a separate Feature / design to solve the large payloads issue? It's confusing from the issue title.

konflux-ci / build-definitions

buildah-remote task: concurrency problems with source workspace #1275