konflux-ci / build-definitions

Apache License 2.0
23 stars 134 forks source link

buildah-remote task: concurrency problems with source workspace #1275

Open owtaylor opened 3 months ago

owtaylor commented 3 months ago

With the buildah-remote task, there are problems if several are run at the same time with the same workspace. For example, I saw:

But there are all sorts of other failures and corruptions that can occur. The workaround people have used for this is to use separate workspaces and duplicate the prefetch-dependencies steps. That works, with the downsides:

As far as I can tell, this could be fixed pretty simply by making the buildah/buildah-remote tasks align more closely with the buildah-remote-oci-tasks - use an emptydir workdir for temporary files rather than doing that in the workspace.

brianwcook commented 3 months ago

is this somehow specific to using prefetch? I haven't had any of these concurrency problems in my multi-arch pipelines, but I usually do not have prefetch turned there.

owtaylor commented 3 months ago

is this somehow specific to using prefetch? I haven't had any of these concurrency problems in my multi-arch pipelines, but I usually do not have prefetch turned there.

My guess is that your multi-arch pipelines use separate workspaces. I only hit this because I wasn't following a well tested example. The downside of using separate workspaces (other than complexity) is prefetch specific, and especially when the prefetch is large.

lcarva commented 3 months ago

Maybe the fix is to move to using Trusted Artifacts?

owtaylor commented 3 months ago

Maybe the fix is to move to using Trusted Artifacts?

From discussion with @arewm and @brianwcook yesterday, it sounded like they thought keeping the non-oci-ta variants around was going to be necessary for projects with "very large" artifacts.

Certainly for the flatpak-runtime / flatpak-sdk images I'm working on, I have some doubts about the performance and storage usage of putting the prefetched RPMs (> 10GB for all architectures + source) into a registry artifact, though I haven't tried it yet.

brianwcook commented 3 months ago

Ah, I am pretty much exclusively using trusted artifacts so that makes sense.

brianwcook commented 3 months ago

Maybe the fix is to move to using Trusted Artifacts?

From discussion with @arewm and @brianwcook yesterday, it sounded like they thought keeping the non-oci-ta variants around was going to be necessary for projects with "very large" artifacts.

Certainly for the flatpak-runtime / flatpak-sdk images I'm working on, I have some doubts about the performance and storage usage of putting the prefetched RPMs (> 10GB for all architectures + source) into a registry artifact, though I haven't tried it yet.

so 10GB / 4 Arch = 2.5GB per arch ... I think there is a reasonable chance you will have a better experience using trusteed artifacts instead of PVC workspaces.

brianwcook commented 3 months ago

@arewm will the Matrix feature be able to reduce the complexity of using multiple prefetch tasks & workspaces?

owtaylor commented 3 months ago

so 10GB / 4 Arch = 2.5GB per arch ... I think there is a reasonable chance you will have a better experience using trusteed artifacts instead of PVC workspaces.

If we can't split the prefetch[^1] by arch and source then a single PVC workspace is clearly way better than downloading 10GB 4 times to use 2.5G of it. If we can split it, then both the Trusted Artifact and PVC versions benefit from that - the PVC version will still avoid pushing and pulling 10GB to the registry and leaving it there indefinitely.

Anyways, I filed an issue here to publicly record the issue and a possible fix. Could also do something like:

diff --git a/task/buildah-remote/0.2/buildah-remote.yaml b/task/buildah-remote/0.2/buildah-remote.yaml
index 2c5ae78..7c793c5 100644
--- a/task/buildah-remote/0.2/buildah-remote.yaml
+++ b/task/buildah-remote/0.2/buildah-remote.yaml
@@ -447,6 +447,10 @@ spec:
        -v "$BUILD_DIR/tekton-results/:/tekton/results:Z" \
        -v $BUILD_DIR/scripts:/script:Z \
       --user=0  --rm  "$BUILDER_IMAGE" /script/script-build.sh
+      if [ -d rhtap-final-image ] ; then
+          echo "rhtap-final-image already exists, multiple buildah-remote tasks must use separate workspaces"
+          exit 1
+      fi
       rsync -ra "$SSH_HOST:$BUILD_DIR/workspaces/source/" "$(workspaces.source.path)/"
       rsync -ra "$SSH_HOST:$BUILD_DIR/volumes/shared/" /shared/
       rsync -ra "$SSH_HOST:$BUILD_DIR/tekton-results/" "/tekton/results/"

[^1]: Splitting a RPM prefetch should be doable without any cachi2 changes, though clumsy, by using separate rpms.lock.yaml in separate subdirectories of the source. Haven't tested that yet.

arewm commented 3 months ago

Matrix builds are defined by setting parameters for task runs so I don't think that it would support associating specific workspaces for each of the runs. The challenge here is that the workspace exists outside of just the taskrun as you need to specify each of the shared workspaces for the entire pipeline run.

It seems like it could be reasonable to use a matrix for prefetching as well as builds using TA. If the OCI trusted artifact is pushed with some architecture specification in the prefetch then it should be possible for the consumer to identify the proper arch match for its needs.

arewm commented 3 months ago

I have heard that it was an intentional decision to have a single prefetch task in the pipeline and not farm it out to multiple arch-specific ones. I am not familiar with the specifics of that decision, however.

@brunoapimentel, are you aware or do you know who would be able to shed light on the current implementation?

brianwcook commented 3 months ago

I don't know why it is the way it is either, but I am really curious to see how much stuff breaks if we did do per-arch prefetch so I added it to a fork of cachi2

https://github.com/brianwcook/cachi2/commit/707a7d053ac6c28627fd55e0c475058572802ec6

chmeliik commented 3 months ago

I don't know why it is the way it is either, but I am really curious to see how much stuff breaks if we did do per-arch prefetch so I added it to a fork of cachi2

brianwcook/cachi2@707a7d0

Probably just source containers (they won't break, but they'll potentially be missing sources)

And for everything except RPMs, running prefetch more than once is a waste

owtaylor commented 3 months ago

I don't know why it is the way it is either, but I am really curious to see how much stuff breaks if we did do per-arch prefetch so I added it to a fork of cachi2

https://github.com/brianwcook/cachi2/commit/707a7d053ac6c28627fd55e0c475058572802ec6

Cool. I think it might make more sense to do it as a command line option, so the input definition can be shared among tasks. When I prototyped implementing split prefetch using separate rpms.lock.in in subdirs, it was quite awkward to use a different input per prefetch - I could no longer just define the prefetch input for the entire pipeline - there's no way I know to say "Use params.prefetch-input subsituting @ARCH@ with $arch".

Probably just source containers (they won't break, but they'll potentially be missing sources)

I think you'd do the source prefetch separately - either by having cachi2 merge all arches[].source or by changing the rpms.lock.yaml format so that source packages were listed separately from the architectures.

And for everything except RPMs, running prefetch more than once is a waste

There are probably sticky cases where you want to download a lot of RPMs, and a lot of something else, but the worst case for RPM prefetch is something like a base image where you have gigabytes of RPMs and nothing else. I'll collect stats for size and performance when I finish Flatpak runtime and SDK and post that somewhere - may give some idea of when split prefetch might be useful, if at all.

brianwcook commented 3 months ago

I did add it as a command line option (cachi2 taks a json payload on the cli).

I'm not sure if it makes more sense to have 4 prefetch tasks, or to have 1 prefetch task that produces a cache which is well separated so that it can be transferred separately (maybe this wouldn't break source containers, but it is definitely much more complicated to implement).

--

BRIAN COOK (he/him/his)

Sr. Principal Product Manager

Konflux & RHT Product Build Systems

On Mon, Aug 12, 2024 at 9:22 AM Owen Taylor @.***> wrote:

I don't know why it is the way it is either, but I am really curious to see how much stuff breaks if we did do per-arch prefetch so I added it to a fork of cachi2

@.*** https://github.com/brianwcook/cachi2/commit/707a7d053ac6c28627fd55e0c475058572802ec6

Cool. I think it might make more sense to do it as a command line option, so the input definition can be shared among tasks. When I prototyped implementing split prefetch using separate rpms.lock.in in subdirs, it was quite awkward to use a different input per prefetch - I could no longer just define the prefetch input for the entire pipeline - there's no way I know to say "Use params.prefetch-input subsituting @arch https://github.com/arch@ with $arch".

Probably just source containers (they won't break, but they'll potentially be missing sources)

I think you'd do the source prefetch separately - either by having cachi2 merge all arches[].source or by changing the rpms.lock.yaml format so that source packages were listed separately from the architectures.

And for everything except RPMs, running prefetch more than once is a waste

There are probably sticky cases where you want to download a lot of RPMs, and a lot of something else, but the worst case for RPM prefetch is something like a base image where you have gigabytes of RPMs and nothing else. I'll collect stats for size and performance when I finish Flatpak runtime and SDK and post that somewhere - may give some idea of when split prefetch might be useful, if at all.

— Reply to this email directly, view it on GitHub https://github.com/konflux-ci/build-definitions/issues/1275#issuecomment-2283980010, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA4DQJ3PT6V3EHQ47JUL37DZRCZI5AVCNFSM6AAAAABMIT5UTSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOBTHE4DAMBRGA . You are receiving this because you were mentioned.Message ID: @.***>

owtaylor commented 3 months ago

Here's some timings for the two Flatpak images we produce the runtime (common dependencies) and SDK (extended runtime with compilers, headers, etc.)

  Size (MiB) Prefetch (s) Create/Upload (s) Restore (s) rsync (s)
flatpak-runtime - amd64 295 40 35 9 4
flatpak-runtime - arm64 271 40 30 7 2
flatpak-runtime - ppc64le 281 44 31 7 6
flatpak-runtime - s390x 285 33 24 9 6
flatpak-runtime - source 2719 130 257 42 ---
 Total: 3851 287 377 74 18
           
flatpak-sdk - amd64 572 545 61 18 4
flatpak-sdk - arm64 536 528 50 20 3
flatpak-sdk - ppc64le 546 528 52 25 11
flatpak-sdk - s390x 544 530 48 13 11
flatpak-sdk - source 3552 888 336 159 ---
Total:  5750 3019 547 235 29

The timings are a bit all over the place - especially for the prefetch step. (The network connection between the Konflux cluster and the internal repositories prefetch is pulling from seems to be the bottleneck.) But it gives a a general idea.

My observations:

owtaylor commented 3 months ago

Cool. I think it might make more sense to do it as a command line option, so the input definition can be shared among tasks. When I prototyped implementing split prefetch using separate rpms.lock.in in subdirs, it was quite awkward to use a different input per prefetch - I could no longer just define the prefetch input for the entire pipeline - there's no way I know to say "Use params.prefetch-input subsituting https://github.com/arch@ with $arch". I did add it as a command line option (cachi2 taks a json payload on the cli).

Well, OK :-) , but what I mean is a command line option that's separate from the input specification, since the input specification is something that the component should be able to control without modifying individual pipeline taskruns.

arewm commented 3 months ago

I'm not sure if it makes more sense to have 4 prefetch tasks, or to have 1 prefetch task that produces a cache which is well separated so that it can be transferred separately (maybe this wouldn't break source containers, but it is definitely much more complicated to implement).

There are two goals that are at odds with each other:

  1. Reducing the amount of content that is pulled by the build container if prefetches are arch-specific (i.e. multiple artifacts)
  2. Ensuring that all prefetched content is present for generating the source container (i.e. a single artifact)

I haven't looked into how the directory is structured when prefetching content, but a potential high-level overview of the required changes is:

@mmorhun @zregvart , fyi as well to bring your attention to this discussion.

zregvart commented 3 months ago
  1. Reducing the amount of content that is pulled by the build container if prefetches are arch-specific (i.e. multiple artifacts)

    1. Ensuring that all prefetched content is present for generating the source container (i.e. a single artifact)

Not sure why we could not generate a prefetch trusted artifact per-platform? A single source and platform specific trusted artifact can be used in per-platform build step, and restoring the single source and multiple per-platform trusted artifacts when building the source container. This all relies on destinations for per-platform prefetch trusted artifacts not overwriting each other, i.e. being created and restored to platform-specific directories.

chmeliik commented 3 months ago
  1. Reducing the amount of content that is pulled by the build container if prefetches are arch-specific (i.e. multiple artifacts)

    1. Ensuring that all prefetched content is present for generating the source container (i.e. a single artifact)

Not sure why we could not generate a prefetch trusted artifact per-platform? A single source and platform specific trusted artifact can be used in per-platform build step, and restoring the single source and multiple per-platform trusted artifacts when building the source container. This all relies on destinations for per-platform prefetch trusted artifacts not overwriting each other, i.e. being created and restored to platform-specific directories.

That does sound like a better idea. But if it also needs to work for the non-TA pipeline, it complicates things.

Most multiarch pipelines I've seen use one workspace per arch and run one prefetch per arch (I don't fully understand why, but I think it was due to the volumes being ReadWriteOnce - not possible to run builds in parallel with a single workspace). Which would mean the source-container task needs to mount a variable number of workspaces.

In practice, the way to achieve this would be to declare a hardcoded number of workspaces in the source task (let's say 10) and Konflux users would have to pass their per-arch workspaces explicitly, something like this

- name: workspace-0
  workspace: workspace-x86_64
- name: workspace-1
  workspace: workspace-aarch64
- name: workspace-2
  workspace: workspace-ppc64le
- name: workspace-3
  workspace: workspace-s390x

Side note: the only reason why this is currently not necessary is that it's almost impossible to make the prefetch task behave differently per arch, except for Owen's workaround:

Splitting a RPM prefetch should be doable without any cachi2 changes, though clumsy, by using separate rpms.lock.yaml in separate subdirectories of the source.

arewm commented 3 months ago

Most of the multi-arch pipelines have a single PVC per workspace to prevent the current issue ... preventing the unnecessary copying/syncing of data to remote workers for the other architectures.

In my proposed pipeline for multi-arch with matrix (and OCI trusted artifacts), none of these workspaces exist: https://github.com/konflux-ci/build-definitions/pull/1236.

While the requested feature to split the prefetch can be done when the data is stored in PVC, there is a more pressing need for it just in the OCI trusted artifacts because of the additional data transfer to/from the registry (this cost is not realized with the PVC except for when we rsync data to the remote host).

chmeliik commented 3 months ago

I kind of lost track of what we're discussing in this issue

lkolacek commented 3 weeks ago

I also lost track of the issue we're discussing here and the discussion itself seems to be quiet for a long time. Is it possible that the issue is not relevant anymore with moving to Trusted artifacts pipeline? I'll close it in a week if there won't be any response.

arewm commented 3 weeks ago

I think that the original issue that was described (i.e. not having uniqueness in the images when synced back to a shared PVC) might have been incidentally/unintentionally resolved with https://github.com/konflux-ci/build-definitions/pull/1216.

We now push and pull the image which automatically has the platform appended to it.

The resulting conversation is unresolved ... i.e. how should we properly handle/design prefetching large payloads for multiple architectures.

mmorhun commented 2 weeks ago

Should we have a separate Feature / design to solve the large payloads issue? It's confusing from the issue title.