helmholtz-analytics / heat

Distributed tensors and Machine Learning framework with GPU and MPI acceleration in Python
https://heat.readthedocs.io/
MIT License
210 stars 53 forks source link

Extend CI to Arm64-CPU #1249

Closed mrfh92 closed 6 months ago

mrfh92 commented 1 year ago

Helmholtz codebase offers a runner with a Cavium ThunderX 88XX CPU with 12 cores with AArch64-architecture. Although there is no GPU available on this runner, I might make sense to run at least the CPU-tests on this runner as well, in particular because Arm-architecture is currently on the rise in european HPC.

github-actions[bot] commented 10 months ago

This issue is stale because it has been open for 60 days with no activity.

mrfh92 commented 6 months ago

I have set up a half-running CI in codebase (branch: arm) using either

image: armswdev/pytorch-arm-neoverse:r24.02-torch-2.2.0-rc8-openblas
image: armswdev/pytorch-arm-neoverse:r24.02-torch-2.2.0-rc8-onednn-acl

Current results (using the second container)

on 1 process:

FAILED heat/core/linalg/tests/test_solver.py::TestSolver::test_lanczos - AssertionError: False is not true
FAILED heat/sparse/tests/test_arithmetics.py::TestArithmetics::test_add - RuntimeError: Calling add on a sparse CPU tensor requires compiling PyTorch with MKL. Please use PyTorch built MKL support.
====== 2 failed, 450 passed, 6 skipped, 11 warnings in 381.38s (0:06:21) =======

on 2 processes:

=========================== short test summary info ============================
FAILED heat/core/linalg/tests/test_solver.py::TestSolver::test_lanczos - AssertionError: False is not true
FAILED heat/sparse/tests/test_arithmetics.py::TestArithmetics::test_add - RuntimeError: Calling add on a sparse CPU tensor requires compiling PyTorch with MKL. Please use PyTorch built MKL support.
====== 2 failed, 453 passed, 6 skipped, 15 warnings in 181.75s (0:03:01) =======

and on 8 processes we get a very strange error: https://codebase.helmholtz.cloud/helmholtz-analytics/ci/-/jobs/1582022

github-actions[bot] commented 6 months ago

Branch features/1249-Extend_CI_to_Arm64-CPU created!

mrfh92 commented 6 months ago

closed as not of highest priority and it is not really possible to determine whether the errors above are due to outdated hardware (CPU of about 2014)