spack / spack

A flexible package manager that supports multiple versions, configurations, platforms, and compilers.
https://spack.io
Other
4.43k stars 2.29k forks source link

provider conditionals with "when: '%openapi'" breaks spack - not all root specs are concretized, and no error messages. #43350

Open pbisbal1 opened 8 months ago

pbisbal1 commented 8 months ago

Steps to reproduce

I'm trying to have spack use intel-oneapi-mkl as the provider for BLAS and LAPACK when the compiler is oneapi, and amdblis and amdlibflame when using using aocc or gcc. With @becker33's help I arrived at the following solution: (only BLAS section shown same arrangement for LAPACK, too):

  packages:
    blas:
      require:
      - spec: 'amdblis'
        when: '%aocc'
      - spec: 'amdblis'
        when: '%gcc'
      - spec: 'intel-oneapi-mkl'
        when: '%oneapi'

When I do spack concretize -f, the concretizer appears to complete w/o error, but when you look at the packages concretized, a number of them are missing. If I run spack install, the installation doesn't complete, but no errors are shown. I just notice that if I installed say, 100 packages, the last package installed will say something like [77/100], indicating that spack stopped installing before all 100 packages are installed, but there's no obvious error messages. If I comment out the lines above pertaining to intel-oneapi like this:

  packages:
    blas:
      require:
      - spec: 'amdblis'
        when: '%aocc'
      - spec: 'amdblis'
        when: '%gcc'
#   - spec: 'intel-oneapi-mkl'
#      when: '%oneapi'

The concretize/install process works as expected, but the packages compiled with %oneapi aren't using the desired BLAS provider (openblas is used instead, most likely because that's the default provider in etc/spack/defaults/packages.yaml) Here is a minimal spack.yaml I'm using to reproduce the problem:

spack:

  definitions:
  - serial_packages:
    - amdblis
    - amdlibm
  - mpi_packages:
    - hpl
    - intel-oneapi-mkl

  specs:
  - matrix:
    - ["$mpi_packages"]
    - ["%aocc@4.1.0"]
    - ["^openmpi%aocc@4.1.0"]
    exclude:
    - intel-oneapi-mkl
  - matrix:
    - ["$mpi_packages"]
    - ["%gcc@13.1.0"]
    - ["^openmpi%gcc@13.1.0"]
    exclude:
    - intel-oneapi-mkl
  - matrix:
    - ["$mpi_packages"]
    - ["%oneapi@2023.2.0"]
    - ["^openmpi%oneapi@2023.2.0"]
  view: false

  concretizer:
    unify: when_possible
    reuse: dependencies

  packages:
    blas:
      require:
      - spec: 'amdblis'
        when: '%aocc'
      - spec: 'amdblis'
        when: '%gcc'
      - spec: 'intel-oneapi-mkl'
        when: '%oneapi'
    hwloc:
      require:
      - ~netloc
      - ~rocm
    hpl: 
      require: 
      - '@2.3'
    mpi:
      require: openmpi
    openmpi:
      require:
      - '@4.1.6'
      - fabrics=hcoll,ucx
      - ~internal-hwloc
      - ~internal-pmix
      - ~rsh
      - schedulers=slurm

  modules:
    default:
      enable:
      - lmod
      roots:
        lmod: modules
      lmod:
        hierarchy:
        - mpi
        - lapack
        hash_length: 0
        include:
        - gcc
        - aocc
        - intel-oneapi
        exclude:
        - '%gcc@11.3.1'
        all:
          environment:
            set:
              '{name}_ROOT': '{prefix}'
        projections:
          all: '{name}/{version}'
        core_compilers:
        - gcc@=11.3.1

Error message

Without using spack -d, there are no obvious error messages from the concretizer. The only way to see the an error is to look at the concretizer output and see what packages were concretized. When I run the concretizer with -d I see these messages:

==> [2024-03-25-11:44:45.542524] UNKNOWN SYMBOL: attr("virtual_on_incoming_edges", NodeArgument(id='0', pkg='libiconv'), iconv)
==> [2024-03-25-11:44:45.542797] UNKNOWN SYMBOL: attr("external_conditions_hold", NodeArgument(id='0', pkg='ucx'), 0)
==> [2024-03-25-11:44:45.542864] UNKNOWN SYMBOL: attr("external_conditions_hold", NodeArgument(id='0', pkg='slurm'), 0)
==> [2024-03-25-11:44:45.542899] UNKNOWN SYMBOL: attr("virtual_on_incoming_edges", NodeArgument(id='0', pkg='zlib-ng'), zlib-api)
==> [2024-03-25-11:44:45.543262] UNKNOWN SYMBOL: attr("external_conditions_hold", NodeArgument(id='0', pkg='openssl'), 0)
==> [2024-03-25-11:44:45.543322] UNKNOWN SYMBOL: attr("virtual_on_incoming_edges", NodeArgument(id='0', pkg='intel-tbb'), tbb)
==> [2024-03-25-11:44:45.543889] UNKNOWN SYMBOL: attr("virtual_on_incoming_edges", NodeArgument(id='0', pkg='openmpi'), mpi)
==> [2024-03-25-11:44:45.543979] UNKNOWN SYMBOL: attr("external_conditions_hold", NodeArgument(id='0', pkg='hcoll'), 0)

But I see similar errors even when I comment out the %oneapi lines, so I don't think they're related to this problem. I've attached the output of running spack -d concretizer -f in both cases for you to look at: concretizer_debug_output_w_oneapi.txt concretizer_debug_output_wo_oneapi.txt

Information on your system

General information

pbisbal1 commented 8 months ago

As a workaround, I've done this which seems to work. In etc/spack/packages.yaml I added these lines to make intel-oneapi-mkl the default for blas and lapack:

packages:
  all:
    providers:
      blas: [intel-oneapi-mkl, amdblis]
      lapack: [intel-oneapi-mkl, amdlibflame]

This works in my test environment (the spack.yaml shown above). I haven't tested in my production environment yet. In addition to providing a usable workaround, this also seems to confirm that the problem is in using 'intel-oneapi-mkl' in a where: statement.

Since this is an AMD-based cluster, I'd prefer being able to make amdblis/amdlibflame the defaults and make intel-oneapi-mkl the exception when compiling with oneapi.

Prentice

pbisbal1 commented 8 months ago

Correction to that workaround... The concretizer seems to concretize everything when using that workaround, but things are NOT being concretized as desired. hpl%oneapi is being concretized with amdblis as the blas provider instead of intel-oneapi-mkl.

scheibelp commented 7 months ago

If I run spack install, the installation doesn't complete, but no errors are shown.

If you update to https://github.com/spack/spack/commit/e78484f501178cb71be363a78762c237240816aa, the errors will no longer be silent

(that commit won't actually make it so the concretization succeeds, but it will prevent the failure from being silent)

https://github.com/spack/spack/issues/43475 (once a PR is created for it) should make use cases like this easier.