Installation issue: rocblas@4.2.0%gcc@10.3.0 fails with "OSError: [Errno 12] Cannot allocate memory"

UweSauter commented 3 years ago

Steps to reproduce the issue

@arjun-raj-kuppala @haampie @srekolam

Trying to install ROCm 4.2.0 inside a Spack environment on a 24 core server with 64GiB RAM, 16 GiB /tmp.

spack.yaml

spack:
  concretization: separately
  packages:
    all:
      compiler: [gcc@10.3.0]
  specs:
  - matrix:
    - ['%gcc@10.3.0^cmake@3.17.5']
    - [atmi@4.2.0, comgr@4.2.0, hip@4.2.0, hipblas@4.2.0, hipcub@4.2.0, hipfft@4.2.0,
      hipfort@4.2.0, hipify-clang@4.2.0, hip-rocclr@4.2.0, hipsparse@4.2.0, hsakmt-roct@4.2.0,
      hsa-rocr-dev@4.2.0, llvm-amdgpu@4.2.0, migraphx@4.2.0, miopengemm@4.2.0, miopen-hip@4.2.0,
      miopen-opencl@4.2.0, mivisionx@4.2.0, rccl@4.2.0, rdc@4.2.0, rocalution@4.2.0,
      rocblas@4.2.0, rocfft@4.2.0, rocm-bandwidth-test@4.2.0, rocm-clang-ocl@4.2.0,
      rocm-cmake@4.2.0, rocm-dbgapi@4.2.0, rocm-debug-agent@4.2.0, rocm-gdb@4.2.0,
      rocminfo@4.2.0, rocm-opencl@4.2.0, rocm-openmp-extras@4.2.0, rocm-smi-lib@4.2.0,
      rocm-tensile@4.2.0, rocm-validation-suite@4.2.0, rocprim@4.2.0, rocprofiler-dev@4.2.0,
      rocrand@4.2.0, rocsolver@4.2.0, rocsparse@4.2.0, rocthrust@4.2.0, roctracer-dev@4.2.0] # rocm-device-libs@4.2.0,
  view: true

results in

$ spack install -j20
[…]
==> Installing rocblas-4.2.0-tcrmmycbthlflr5xqq2xfxdfpaxft3ok                                                                                         [37/152]
==> No binary for rocblas-4.2.0-tcrmmycbthlflr5xqq2xfxdfpaxft3ok found: installing from source
==> Warning: Suspicious requests to set or unset 'CXX' found
==> Warning:            env.set('CXX', os.path.join(link_dir, compiler.link_paths['cxx'])) at /sw/vulcan-CentOS7/hlrs/non-spack/misc/spack.2021-06-16/spack.de
velop/lib/spack/spack/build_environment.py:257
==> Warning:    --->     at /sw/vulcan-CentOS7/hlrs/non-spack/misc/spack.2021-06-16/spack.develop/var/spack/repos/builtin/packages/rocblas/package.py:82
==> Using cached archive: /tmp/spack/hpcoft30/src/_source-cache/archive/54/547f6d5d38a41786839f01c5bfa46ffe9937b389193a8891f251e276a1a47fb0.tar.gz
==> Using cached archive: /tmp/spack/hpcoft30/src/_source-cache/git//ROCmSoftwarePlatform/Tensile.git/3438af228dc812768b20a068b0285122f327fa5b.tar.gz
==> Warning: Fetching from mirror without a checksum!
  This package is normally checked out from a version control system, but it has been archived on a spack mirror.  This means we cannot know a checksum for th
e tarball in advance. Be sure that your connection to this mirror is secure!
==> Moving resource stage
        source : /tmp/spack/hpcoft30/build/resource-Tensile-tcrmmycbthlflr5xqq2xfxdfpaxft3ok/spack-src/
        destination : /tmp/spack/hpcoft30/build/spack-stage-rocblas-4.2.0-tcrmmycbthlflr5xqq2xfxdfpaxft3ok/spack-src/Tensile
==> No patches needed for rocblas
==> rocblas: Executing phase: 'cmake'
==> rocblas: Executing phase: 'build'
==> Error: ProcessError: Command exited with status 2:
    'make' '-j20'

3 errors found in build log:
     1008        return Popen(process_obj)
     1009      File "/opt/hlrs/spack/2021-09-30/python/3.8.11-gcc-10.3.0-3abeedd2/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
     1010        self._launch(process_obj)
     1011      File "/opt/hlrs/spack/2021-09-30/python/3.8.11-gcc-10.3.0-3abeedd2/lib/python3.8/multiprocessing/popen_fork.py", line 70, in _launch
     1012        self.pid = os.fork()
     1013    OSError: [Errno 12] Cannot allocate memory
  >> 1014    make[2]: *** [Tensile/library/TensileLibrary.dat] Error 1
     1015    make[2]: Leaving directory `/tmp/spack/hpcoft30/build/spack-stage-rocblas-4.2.0-tcrmmycbthlflr5xqq2xfxdfpaxft3ok/spack-build-tcrmmyc'
  >> 1016    make[1]: *** [library/src/CMakeFiles/TENSILE_LIBRARY_TARGET.dir/all] Error 2
     1017    make[1]: Leaving directory `/tmp/spack/hpcoft30/build/spack-stage-rocblas-4.2.0-tcrmmycbthlflr5xqq2xfxdfpaxft3ok/spack-build-tcrmmyc'
  >> 1018    make: *** [all] Error 2

See build log for details:
  /tmp/spack/hpcoft30/build/spack-stage-rocblas-4.2.0-tcrmmycbthlflr5xqq2xfxdfpaxft3ok/spack-build-out.txt

==> Warning: Skipping build of hipblas-4.2.0-igpxkkncnsb6fk64i7qq3c2sihzpwbgf since rocblas-4.2.0-tcrmmycbthlflr5xqq2xfxdfpaxft3ok failed
==> Warning: Skipping build of miopen-hip-4.2.0-fcrcxdxgevk5x2tpulmdsyf6yclxkfx4 since rocblas-4.2.0-tcrmmycbthlflr5xqq2xfxdfpaxft3ok failed
==> Warning: Skipping build of migraphx-4.2.0-lpj3nxlbodkdys4s2ttnc64v5yrkkwb3 since miopen-hip-4.2.0-fcrcxdxgevk5x2tpulmdsyf6yclxkfx4 failed
==> Warning: Skipping build of miopen-hip-4.2.0-yqcjyia2y4z2rnshyqzqcfa6cmwardzi since rocblas-4.2.0-tcrmmycbthlflr5xqq2xfxdfpaxft3ok failed
==> Warning: Skipping build of rocsolver-4.2.0-uod4w6pjo7vezpficidsa75wcctqvuxb since rocblas-4.2.0-tcrmmycbthlflr5xqq2xfxdfpaxft3ok failed
==> Warning: Skipping build of rocm-validation-suite-4.2.0-hr2a7hn6rtyd7dz2g2quz7yqohbrttv6 since rocblas-4.2.0-tcrmmycbthlflr5xqq2xfxdfpaxft3ok failed
==> Warning: Skipping build of rocalution-4.2.0-z6kjsfpahpsjdqi4pqoin7x6bg2dn42s since rocblas-4.2.0-tcrmmycbthlflr5xqq2xfxdfpaxft3ok failed
==> Error: Terminating after first install failure: ProcessError: Command exited with status 2:
    'make' '-j20'

Information on your system

$ spack debug report
* **Spack:** 0.16.3-4541-9e70fc6
* **Python:** 3.6.8
* **Platform:** linux-centos7-x86_64
* **Concretizer:** original

spack-build-out.txt

spack-build-env.txt

Additional information

No response

General information

[X] I have run spack debug report and reported the version of Spack/Python/Platform
[X] I have run spack maintainers <name-of-the-package> and @mentioned any maintainers
[X] I have uploaded the build log and environment files
[X] I have searched the issues of this repo and believe this is not a duplicate

haampie commented 3 years ago

don't underestimate the amount of ram you need to build rocblas :sweat_smile:

UweSauter commented 3 years ago

As mentioned, the server has 64GiB… and rocblas should also build on workstations equipped with less RAM.

haampie commented 3 years ago

I mean, if you insist on building in /tmp, can't you make it larger? Otherwise just change your stage directory in the config. Or play with build_jobs, etc.

UweSauter commented 3 years ago

I've been monitoring the usage of /tmp (which is on disk, not in RAM) and it never used more 20% of the available space.

Sure, it might be that I've missed that point in time when the usage exploded but on the other hand I cannot conclude from the logs that /tmp is the problem here.

Also, Spack seems to be cleaning up after the failure so I cannot in retrospect see where it hit the limitation.

haampie commented 3 years ago

On failure it doesn't clean the stage dir, if for some reason it does, do spack install --keep-stage ...

UweSauter commented 3 years ago

Also, Cannot allocate memory implies problems allocating RAM, not problems writing to disk. And as I said, /tmp is on a disk.

haampie commented 3 years ago

Can you try rocblas tensile_architecture=gfx908 or so? Also please consider opening an issue upstream (rocblas or tensile).

UweSauter commented 3 years ago

Once the build host runs into the next issue with ROCm 4.3.1 I'll try to build rocblas@4.2.0 with only one process (-j1) but I suspect that this won't help.

Would you mind telling me the whole Spack command? I'm not yet familiar with all the possible arguments. I have 2 types of GPUS: gfx900 and gfx906

haampie commented 3 years ago

Ah, try spack install rocblas tensile_architecture=gfx906, it definitely limits the amount of things being built. If this succeeds, it's worth opening an issue upstream asking about the ram requirements, and if it can be brought down somehow.

See spack info rocblas for accepted values for the variants

haampie commented 3 years ago

@srekolam seems like tensile_architecture=gfx900,gfx906 does not work, should it accept multiple values?

UweSauter commented 3 years ago

So, rocblas@4.3.1%gcc@7.5.0 shows the same Cannot allocate memory when Spack is called with -j20.

spack install -j1 --keep-stage rocblas@4.2.0%gcc@10.3.0 tensile_architecture=gfx906 is currently running.

UweSauter commented 3 years ago

Hm, searching for OSError: [Errno 12] Cannot allocate memory I found this article:

[…]
The gist was that os.fork() duplicates the parent process, including its memory footprint. If the memory
footprint exceeds the available system memory, it will fail, if your overcommit memory setting is restricted.
[…]

My overcommit settings are

vm.overcommit_kbytes = 0
vm.overcommit_memory = 2
vm.overcommit_ratio = 97

Kernel overcommit documentation

haampie commented 3 years ago

Again, might be worth filing this issue upstream, since it has to do with some python script written by amd: https://github.com/ROCmSoftwarePlatform/Tensile

UweSauter commented 3 years ago

AMD is already aware of this issue but the devs I have contact with are based in California.

UweSauter commented 3 years ago

After calling spack clean -a the run of spack install -j1 --keep-stage rocblas@4.2.0%gcc@10.3.0 tensile_architecture=gfx906 succeeded.

UweSauter commented 3 years ago

When building in combination with the other packages of the Spack environment (spack install -j1), it fails again.

When changing self.define('BUILD_WITH_TENSILE', 'ON') to OFF in the package.yaml the build succeeds inside the Spack environment.

UweSauter commented 3 years ago

rocblas@4.3.1%gcc@7.5.0 also builds successfully when 'BUILD_WITH_TENSILE' is set to 'OFF'

TorreZuk commented 3 years ago

Ensure you don't have HIPCC_COMPILE_FLAGS_APPEND env variable include -parallel-jobs=X where X is > 1 set it to one, it should be defaulting to one if the flag isn't there (this is parallel computation of a single unit but forks memory). Otherwise rocblas with Tensile will require the full 64GB RAM for building by itself so can you set your spack build to only build rocblas by itself and resume parallel package building after rocblas is done?

If you are parallel building I would drop the ratio vm.overcommit_ratio = 5 ( you don't really want to swap), remove vm.overcommit_kbytes as you are using the ratio to see if it can still build rocblas by itself. And could also test build all packages without parallelism, it maybe you will build faster if you were trying to build rocblas with others you were swapping memory even when you succeed. Anyway cat your /proc/meminfo to ensure those look expected, and ulimit -a doens't show memory limit.

UweSauter commented 2 years ago

@TorreZuk I'm not actively setting HIPCC_COMPILE_FLAGS_APPEND so I don't know where this should come from and how to check if this is used while Spack tries to compile.

Also, Spack is not building packages in parallel. Yes, using -j20 it tells the compiler to build parallel but Spack itself will build packages sequentially (as far as I can tell from watching top during builds).

And even with -j1 the build fails (both when running spack install inside an environment as well as when running spack install rocblas@4.2.0%gcc@10.3.0 without environment).

Only when I modified rocblas' package.yaml to self.define('BUILD_WITH_TENSILE', 'OFF') I was able to build rocblas.

TorreZuk commented 2 years ago

I don't see a way to set the global variable for Tensile CpuCount as we don't pass --jobs to the tensile build step so it is taking all affinity core count. The python fork clones the memory space per processor available but wouldn't touch most of the memory pages, so I would try to allow overcommit: vm.overcommit_memory = 0 vm.overcommit_ratio = 97 as 64GB should squeak by. An environment variable could be set HIPCC_COMPILE_FLAGS_APPEND can be set to control the compilation but if you don't set it, the default is empty.

UweSauter commented 2 years ago

So I uninstalled rocblas@4.2.0%gcc@10.3.0 and dependent packages and modified vm.overcommit_memory to 0 and BUILD_WITH_TENSILE to ON

With this setting the installation of this combination succeeded.

TorreZuk commented 2 years ago

Good to hear. We will look at hooks for better control of CPU load and memory and generally reducing peak memory usage in future releases but likely won't help until around the version ~5 timeframe so hopefully the build wasn't too slow with a little bit of swap.

UweSauter commented 2 years ago

What amount of time would you expect for the build to take?

The host I'm using is a KVM VM with 20 cores (host has 2x Intel Gold 6138 @ 2.0 GHz) and 64GB memory…

I was watching memory and CPU usage with htop and as far as I can tell, the real memory usage was never above 8GB. But CPU usage was also very low, many of the packages (especially the ones that are Python based) only use 1 core most of the time.

TorreZuk commented 2 years ago

Well if you build rocblas for all GPU architectures (the default) it can roughly take from 2 to 5 hours but I am not familiar with your CPU or disk. A new AMD EPYC CPU would provide much more parallelism so could build much faster. If you build rocblas for a single GPU architecture with the -a flag then around 1 hour or less.

The memory peak usage may spike for short periods but as you found I expect not all of the cloned virtual allocations will be touched. I can't comment on the other packages you are building as I don't build them. Looks like they don't do parallel python from what you report.

srekolam commented 2 years ago

@TorreZuk , thanks for your inputs on this issue . @UweSauter ,good to hear that you have the rocblas installed.

spack / spack