recast-hep / recast-atlas

CLI for ATLAS RECAST contributors
https://recast.docs.cern.ch/
Apache License 2.0
5 stars 5 forks source link

Recast on ARM Architectures #117

Closed Nollde closed 3 months ago

Nollde commented 9 months ago

Problem

Developing Recast workflows on ARM chips (e.g. Apple M1/2) is currently limited as most docker images only support x86 architectures.

Description

Natively docker images are tied to processor architectures. Therefore running images which are built for x86 architectures is not possible on ARM architectures. When using the Recast docker backend, images are started within images which adds an extra layer of complexity.

Solutions

I currently see two solutions to the issue:

  1. Multi-platform images: Docker images can be built for multiple platforms. In this case the user would need to make sure that every used image is correctly built for the architecture they develop with. This is rather complicated for e.g. ATLAS based analysis images. Cross-compilation can be achieved via QEMU.
  2. Use Dockers native emulation at runtime: Docker provides emulation of the x86 architecture at runtime via via QEMU (can be enabled via docker run --platform linux/amd64 ... or env var export DOCKER_DEFAULT_PLATFORM=linux/amd64). While this can easily be enabled for the docker image of Recast itself (currently recast/recastatlas:v0.3.0), enabling it for the docker images that Recast starts would require changes in the Recast code. My temporary local solution can be found here. Using emulation at runtime is easy but can be slow.

I am happy to discuss further steps to enable the development of Recast workflows on ARM architectures.

Comments

Nollde commented 9 months ago

This is a minimum example of the issue:

git clone git@github.com:Nollde/recast_example_docker_x86.git
cd recast_example_docker_x86
conda create -n recast-tmp python
conda activate recast-tmp
pip install recast-atlas
$(recast catalogue add $PWD)
recast run examples/helloworld
matthewfeickert commented 4 months ago

@Nollde In v0.4.0 PR https://github.com/recast-hep/recast-atlas/pull/129 went in which addresses perhaps some of this as setting DOCKER_DEFAULT_PLATFORM in the environment now sets the --platform flag in docker run.

There are now also linux/arm64 platform Docker images in the recast/recastatlas container image manifest for tags v0.4.0 and future following PR https://github.com/recast-hep/recast-atlas/pull/134. This of course doesn't really help unless the other containers in your workflow also have linux/arm64 images in their manifests, but at least provides native containers for local validation checks.

Your minimal failing example in https://github.com/recast-hep/recast-atlas/issues/117#issuecomment-1777745063 should now pass with

$ export DOCKER_DEFAULT_PLATFORM=linux/amd64  # For this example linux/arm64 will work too now
$ recast run --backend docker <workflow name>

Can you give feedback on if this helps for the time being, and if there are additional UX changes that would help here?

Nollde commented 3 months ago

Hi Matthew, thank you very much! I can confirm that my example now works with the suggested changes. I think it is awesome that you can now use the --platform argument.

Just for completeness I attach an example of two executions, first on my native architecture, second emulated:

Native ``` (recast-tmp) ➜ recast_example_docker_x86 git:(main) ✗ export DOCKER_DEFAULT_PLATFORM=linux/arm64/v8 (recast-tmp) ➜ recast_example_docker_x86 git:(main) ✗ time recast run --backend docker examples/helloworld 2024-03-26 22:33:17,946 | packtivity.asyncback | INFO | configured pool size to 10 2024-03-26 22:33:18,020 | yadage.creators | INFO | initializing workflow with initdata: {'input_name': 'standard model'} discover: True relative: True 2024-03-26 22:33:18,021 | adage.pollingexec | INFO | preparing adage coroutine. 2024-03-26 22:33:18,021 | adage | INFO | starting state loop. 2024-03-26 22:33:18,067 | yadage.wflowview | INFO | added 2024-03-26 22:33:18,139 | yadage.wflowview | INFO | added 2024-03-26 22:33:18,194 | adage.pollingexec | INFO | submitting nodes [] 2024-03-26 22:33:18,223 | pack.init.step | INFO | publishing data: 2024-03-26 22:33:18,223 | adage | INFO | unsubmittable: 0 | submitted: 0 | successful: 0 | failed: 0 | total: 2 | open rules: 0 | applied rules: 2 2024-03-26 22:33:18,303 | adage.node | INFO | node ready 2024-03-26 22:33:18,304 | adage.pollingexec | INFO | submitting nodes [] 2024-03-26 22:33:18,307 | pack.hello_world_sta | INFO | starting file logging for topic: step 2024-03-26 22:33:19,680 | adage.node | INFO | node ready 2024-03-26 22:33:19,702 | adage.controllerutil | INFO | no nodes can be run anymore and no rules are applicable 2024-03-26 22:33:19,702 | adage.controllerutil | INFO | no nodes can be run anymore and no rules are applicable 2024-03-26 22:33:19,702 | adage | INFO | unsubmittable: 0 | submitted: 0 | successful: 2 | failed: 0 | total: 2 | open rules: 0 | applied rules: 2 2024-03-26 22:33:24,702 | adage | INFO | adage state loop done. 2024-03-26 22:33:24,702 | adage | INFO | execution valid. (in terms of execution order) 2024-03-26 22:33:24,702 | adage | INFO | workflow completed successfully. 2024-03-26 22:33:24,702 | yadage.steering_api | INFO | done. dumping workflow to disk. 2024-03-26 22:33:24,704 | yadage.steering_api | INFO | visualizing workflow. 2024-03-26 15:33:24,912 | recastatlas.subcomma | INFO | RECAST run finished. RECAST result examples/helloworld recast-d73315e8: -------------- - name: My Result value: Hello my Name is standard model recast run --backend docker examples/helloworld 0.16s user 0.05s system 2% cpu 7.265 total ```
Emulated ``` (recast-tmp) ➜ recast_example_docker_x86 git:(main) ✗ export DOCKER_DEFAULT_PLATFORM=linux/amd64 (recast-tmp) ➜ recast_example_docker_x86 git:(main) ✗ time recast run --backend docker examples/helloworld Unable to find image 'recast/recastatlas:v0.4.0' locally v0.4.0: Pulling from recast/recastatlas Digest: sha256:c9b90515779c10beb7db3780c8d01f21de6c908573dd91cddc9031efca22d9cf Status: Downloaded newer image for recast/recastatlas:v0.4.0 2024-03-26 22:33:40,671 | packtivity.asyncback | INFO | configured pool size to 10 2024-03-26 22:33:40,715 | yadage.creators | INFO | initializing workflow with initdata: {'input_name': 'standard model'} discover: True relative: True 2024-03-26 22:33:40,716 | adage.pollingexec | INFO | preparing adage coroutine. 2024-03-26 22:33:40,716 | adage | INFO | starting state loop. 2024-03-26 22:33:40,770 | yadage.wflowview | INFO | added 2024-03-26 22:33:40,864 | yadage.wflowview | INFO | added 2024-03-26 22:33:40,936 | adage.pollingexec | INFO | submitting nodes [] 2024-03-26 22:33:40,976 | pack.init.step | INFO | publishing data: 2024-03-26 22:33:40,976 | adage | INFO | unsubmittable: 0 | submitted: 0 | successful: 0 | failed: 0 | total: 2 | open rules: 0 | applied rules: 2 2024-03-26 22:33:41,078 | adage.node | INFO | node ready 2024-03-26 22:33:41,078 | adage.pollingexec | INFO | submitting nodes [] 2024-03-26 22:33:41,083 | pack.hello_world_sta | INFO | starting file logging for topic: step 2024-03-26 22:33:42,533 | adage.node | INFO | node ready 2024-03-26 22:33:42,555 | adage.controllerutil | INFO | no nodes can be run anymore and no rules are applicable 2024-03-26 22:33:42,555 | adage.controllerutil | INFO | no nodes can be run anymore and no rules are applicable 2024-03-26 22:33:42,555 | adage | INFO | unsubmittable: 0 | submitted: 0 | successful: 2 | failed: 0 | total: 2 | open rules: 0 | applied rules: 2 2024-03-26 22:33:54,092 | adage | INFO | adage state loop done. 2024-03-26 22:33:54,092 | adage | INFO | execution valid. (in terms of execution order) 2024-03-26 22:33:54,092 | adage | INFO | workflow completed successfully. 2024-03-26 22:33:54,092 | yadage.steering_api | INFO | done. dumping workflow to disk. 2024-03-26 22:33:54,095 | yadage.steering_api | INFO | visualizing workflow. 2024-03-26 15:33:54,476 | recastatlas.subcomma | INFO | RECAST run finished. RECAST result examples/helloworld recast-e40d5aee: -------------- - name: My Result value: Hello my Name is standard model recast run --backend docker examples/helloworld 0.19s user 0.13s system 2% cpu 15.537 total ```

It will probably take some time until whole production workflows can be tested on this (if you would want this as it is potentially much slower) as you also need to make linux/arm64 images for all of them as you say. However, I believe this is not something that can be addressed within recast-atlas.

I greatly appreciate your efforts with the mentioned PR! Since it fully addresses the immediate issue, I think we can consider it resolved and close it.

matthewfeickert commented 3 months ago

Thanks, @Nollde! This was a very helpful Issue, so thank you for taking the time to report it and to debug with me!