Closed UweSauter closed 2 years ago
don't underestimate the amount of ram you need to build rocblas :sweat_smile:
As mentioned, the server has 64GiB… and rocblas should also build on workstations equipped with less RAM.
I mean, if you insist on building in /tmp, can't you make it larger? Otherwise just change your stage directory in the config. Or play with build_jobs, etc.
I've been monitoring the usage of /tmp
(which is on disk, not in RAM) and it never used more 20% of the available space.
Sure, it might be that I've missed that point in time when the usage exploded but on the other hand I cannot conclude from the logs that /tmp
is the problem here.
Also, Spack seems to be cleaning up after the failure so I cannot in retrospect see where it hit the limitation.
On failure it doesn't clean the stage dir, if for some reason it does, do spack install --keep-stage ...
Also, Cannot allocate memory
implies problems allocating RAM, not problems writing to disk. And as I said, /tmp
is on a disk.
Can you try rocblas tensile_architecture=gfx908
or so? Also please consider opening an issue upstream (rocblas or tensile).
Once the build host runs into the next issue with ROCm 4.3.1 I'll try to build rocblas@4.2.0 with only one process (-j1
) but I suspect that this won't help.
Would you mind telling me the whole Spack command? I'm not yet familiar with all the possible arguments.
I have 2 types of GPUS: gfx900
and gfx906
Ah, try spack install rocblas tensile_architecture=gfx906
, it definitely limits the amount of things being built. If this succeeds, it's worth opening an issue upstream asking about the ram requirements, and if it can be brought down somehow.
See spack info rocblas
for accepted values for the variants
@srekolam seems like tensile_architecture=gfx900,gfx906
does not work, should it accept multiple values?
So, rocblas@4.3.1%gcc@7.5.0
shows the same Cannot allocate memory
when Spack is called with -j20
.
spack install -j1 --keep-stage rocblas@4.2.0%gcc@10.3.0 tensile_architecture=gfx906
is currently running.
Hm, searching for OSError: [Errno 12] Cannot allocate memory
I found this article:
[…]
The gist was that os.fork() duplicates the parent process, including its memory footprint. If the memory
footprint exceeds the available system memory, it will fail, if your overcommit memory setting is restricted.
[…]
My overcommit
settings are
Again, might be worth filing this issue upstream, since it has to do with some python script written by amd: https://github.com/ROCmSoftwarePlatform/Tensile
AMD is already aware of this issue but the devs I have contact with are based in California.
After calling spack clean -a
the run of spack install -j1 --keep-stage rocblas@4.2.0%gcc@10.3.0 tensile_architecture=gfx906
succeeded.
When building in combination with the other packages of the Spack environment (spack install -j1
), it fails again.
When changing self.define('BUILD_WITH_TENSILE', 'ON')
to OFF
in the package.yaml
the build succeeds inside the Spack environment.
rocblas@4.3.1%gcc@7.5.0
also builds successfully when 'BUILD_WITH_TENSILE'
is set to 'OFF'
Ensure you don't have HIPCC_COMPILE_FLAGS_APPEND env variable include -parallel-jobs=X where X is > 1 set it to one, it should be defaulting to one if the flag isn't there (this is parallel computation of a single unit but forks memory). Otherwise rocblas with Tensile will require the full 64GB RAM for building by itself so can you set your spack build to only build rocblas by itself and resume parallel package building after rocblas is done?
If you are parallel building I would drop the ratio vm.overcommit_ratio = 5 ( you don't really want to swap), remove vm.overcommit_kbytes as you are using the ratio to see if it can still build rocblas by itself. And could also test build all packages without parallelism, it maybe you will build faster if you were trying to build rocblas with others you were swapping memory even when you succeed. Anyway cat your /proc/meminfo to ensure those look expected, and ulimit -a doens't show memory limit.
@TorreZuk I'm not actively setting HIPCC_COMPILE_FLAGS_APPEND so I don't know where this should come from and how to check if this is used while Spack tries to compile.
Also, Spack is not building packages in parallel. Yes, using -j20
it tells the compiler to build parallel but Spack itself will build packages sequentially (as far as I can tell from watching top
during builds).
And even with -j1
the build fails (both when running spack install
inside an environment as well as when running spack install rocblas@4.2.0%gcc@10.3.0
without environment).
Only when I modified rocblas' package.yaml
to self.define('BUILD_WITH_TENSILE', 'OFF')
I was able to build rocblas.
I don't see a way to set the global variable for Tensile CpuCount as we don't pass --jobs to the tensile build step so it is taking all affinity core count. The python fork clones the memory space per processor available but wouldn't touch most of the memory pages, so I would try to allow overcommit: vm.overcommit_memory = 0 vm.overcommit_ratio = 97 as 64GB should squeak by. An environment variable could be set HIPCC_COMPILE_FLAGS_APPEND can be set to control the compilation but if you don't set it, the default is empty.
So I uninstalled rocblas@4.2.0%gcc@10.3.0
and dependent packages and modified vm.overcommit_memory
to 0
and BUILD_WITH_TENSILE
to ON
With this setting the installation of this combination succeeded.
Good to hear. We will look at hooks for better control of CPU load and memory and generally reducing peak memory usage in future releases but likely won't help until around the version ~5 timeframe so hopefully the build wasn't too slow with a little bit of swap.
What amount of time would you expect for the build to take?
The host I'm using is a KVM VM with 20 cores (host has 2x Intel Gold 6138 @ 2.0 GHz) and 64GB memory…
I was watching memory and CPU usage with htop
and as far as I can tell, the real memory usage was never above 8GB.
But CPU usage was also very low, many of the packages (especially the ones that are Python based) only use 1 core most of the time.
Well if you build rocblas for all GPU architectures (the default) it can roughly take from 2 to 5 hours but I am not familiar with your CPU or disk. A new AMD EPYC CPU would provide much more parallelism so could build much faster. If you build rocblas for a single GPU architecture with the -a flag then around 1 hour or less.
The memory peak usage may spike for short periods but as you found I expect not all of the cloned virtual allocations will be touched. I can't comment on the other packages you are building as I don't build them. Looks like they don't do parallel python from what you report.
@TorreZuk , thanks for your inputs on this issue . @UweSauter ,good to hear that you have the rocblas installed.
Steps to reproduce the issue
@arjun-raj-kuppala @haampie @srekolam
Trying to install ROCm 4.2.0 inside a Spack environment on a 24 core server with 64GiB RAM, 16 GiB
/tmp
.spack.yaml
results in
Information on your system
spack-build-out.txt
spack-build-env.txt
Additional information
No response
General information
spack debug report
and reported the version of Spack/Python/Platformspack maintainers <name-of-the-package>
and @mentioned any maintainers