vscentrum / vsc-software-stack

Central repository of easyconfigs used in the software installations on VSC clusters.
2 stars 6 forks source link

spektral (+ horovod) #390

Open laraPPr opened 2 months ago

laraPPr commented 2 months ago
boegel commented 2 months ago

I'm looking into this myself, seems to be pretty trivial

boegel commented 2 months ago
boegel commented 2 months ago

I'm also looking into updating Horovod for TensorFlow 2.15.1 with foss/2023a, but it's being a PITA:

  /apps/gent/RHEL8/cascadelake-ampere-ib/software/TensorFlow/2.15.1-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/tensorflow/include/tensorflow/compiler/tf2xla/host_compute_metadata.pb.h:17:2: errorr
: #error This file was generated by an older version of protoc which is
     17 | #error This file was generated by an older version of protoc which is
        |  ^~~~~
  /apps/gent/RHEL8/cascadelake-ampere-ib/software/TensorFlow/2.15.1-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/tensorflow/include/tensorflow/compiler/tf2xla/host_compute_metadata.pb.h:18:2: errorr
: #error incompatible with your Protocol Buffer headers. Please
     18 | #error incompatible with your Protocol Buffer headers. Please
        |  ^~~~~
  /apps/gent/RHEL8/cascadelake-ampere-ib/software/TensorFlow/2.15.1-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/tensorflow/include/tensorflow/compiler/tf2xla/host_compute_metadata.pb.h:19:2: errorr
: #error regenerate this file with a newer version of protoc.
     19 | #error regenerate this file with a newer version of protoc.
        |  ^~~~~
boegel commented 2 months ago

PR for spektral with fosscuda/2020b (since getting Horovod working with foss/2023a is providing to be difficult):

Flamefire commented 2 months ago

I'm also looking into updating Horovod for TensorFlow 2.15.1 with foss/2023a, but it's being a PITA:

  /apps/gent/RHEL8/cascadelake-ampere-ib/software/TensorFlow/2.15.1-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/tensorflow/include/tensorflow/compiler/tf2xla/host_compute_metadata.pb.h:17:2: error
     17 | #error This file was generated by an older version of protoc which is
     18 | #error incompatible with your Protocol Buffer headers. Please
     19 | #error regenerate this file with a newer version of protoc.

This header is generated by the protobuf compiler (protoc) during the TF build using this rule: https://github.com/tensorflow/tensorflow/blob/c16161a1cb6ecdef55bf8fc4a2074a5aa8bd4ed0/tensorflow/compiler/tf2xla/BUILD#L106-L114

Protobuf (like too many others) is downloaded during the build of TF but not installed (only the generated files are required. Possibly there is some runtime library, but not sure)

According to this they use protobuf 3.21.9

Hence I guess the solution to the error is to use the same protobuf version as a build dependency in the EC that causes the above error (which one was that?)

It might also be worth looking into using our protobuf as a "SYSTEM_LIB" during the TF build but I expect some patch to be required. At least we can update the issue if we still run into "File already exists in database" and hope they finally answer it.

boegel commented 2 months ago

Keeping this open for now, would like to look into the Horovod issue again...

boegel commented 2 months ago

I looked into trying with protobuf 3.21.9 as build dependency for Horovod, but didn't get very far since something already depends on a different version of protobuf, leading to:

A different version of the 'protobuf' module is already loaded (see output of 'ml').
You should load another 'protobuf-python' module for that is compatible with the currently loaded version of 'protobuf'.
Use 'ml spider protobuf-python' to get an overview of the available versions.

If you don't understand the warning or error, contact the helpdesk at hpc@ugent.be
While processing the following module(s):
    Module fullname                           Module Filename
    ---------------                           ---------------
    protobuf-python/4.24.0-GCCcore-12.3.0     /modules/all/protobuf-python/4.24.0-GCCcore-12.3.0.lua
    grpcio/1.57.0-GCCcore-12.3.0              /modules/all/grpcio/1.57.0-GCCcore-12.3.0.lua
    TensorFlow/2.15.1-foss-2023a-CUDA-12.1.1  /modules/all/TensorFlow/2.15.1-foss-2023a-CUDA-12.1.1.lua

That's because we have Lmod configured with LMOD_DISABLE_SAME_NAME_AUTOSWAP

boegel commented 2 months ago

I got the installation of Horovod working, but only through a nasty hack in the easyconfig, since listing an alternative protobuf version in builddependencies doesn't work:

easyblock = 'PythonBundle'

name = 'Horovod'
version = '0.28.1'
local_tf_version = '2.15.1'
local_cuda_suffix = '-CUDA-%(cudaver)s'
versionsuffix = local_cuda_suffix + '-TensorFlow-%s' % local_tf_version

homepage = 'https://github.com/uber/horovod'
description = "Horovod is a distributed training framework for TensorFlow."

toolchain = {'name': 'foss', 'version': '2023a'}

builddependencies = [
    ('CMake', '3.26.3'),
    # ('protobuf', '3.21.9'),
]
dependencies = [
    ('Python', '3.11.3'),
    ('PyYAML', '6.0'),
    ('CUDA', '12.1.1', '', SYSTEM),
    ('NCCL', '2.18.3', local_cuda_suffix),
    ('TensorFlow', local_tf_version, local_cuda_suffix),
]

use_pip = True
sanity_pip_check = True

preinstallopts = 'module swap protobuf/3.21.9-GCCcore-12.3.0 && HOROVOD_WITH_MPI=1 HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL '
preinstallopts += 'HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITHOUT_PYTORCH=1 HOROVOD_WITHOUT_MXNET=1 '

exts_list = [
    ('cloudpickle', '2.2.1', {
        'checksums': ['d89684b8de9e34a2a43b3460fbca07d09d6e25ce858df4d5a44240403b6178f5'],
    }),
    ('horovod', version, {
        'patches': ['Horovod-0.28.1_support_flatbuffers_2.0.6.patch'],
        'checksums': [
            '92a43f5a94c43907a56805bad15f19700c62ffc83b7ca483f9e104e229f67ef0',
            '9696ffb3b2bad1d6dd5a9f37bc58078ca7c585f933bcbec037036ad9fc0b297d',
        ],
    }),
]

sanity_check_paths = {
    'files': ['bin/horovodrun'],
    'dirs': ['lib/python%(pyshortver)s/site-packages'],
}

sanity_check_commands = ["horovodrun --help"]

moduleclass = 'tools'

see the module swap command in preinstallopts.

In order to do this properly, we would need to add support to EasyBuild framework to swap in a particular module for a (build) dependency rather than just loading it...

Flamefire commented 2 months ago

I'm wondering if this can cause further problems as there is a runtime library for protobuf. Let's hope they are "compatible enough"

I.e. the TF header expects protobuf 3.21.0-3.21.9 but we use 4.24.0 (at runtime). And I found

The runtime library must have the same version with the protocol compiler you use.

So I guess we are actually confined by the protobuf version used by TensorFlow and need to use the same one in all dependents and dependencies. The best approach is likely to fix and use our protobuf when building TF. Otherwise we need to always check the version in the TF sources before deciding on one for the toolchain.

boegel commented 2 months ago

Isn't the protobuf used by TensorFlow "baked in", so that whatever protobuf module is loaded doesn't really matter for TensorFlow itself?

Flamefire commented 2 months ago

Not sure how it could be. Possibly it is statically linked, can't remember. But even when linking statically there can be symbol clashes if a shared protobuf is loaded (e.g. by grpcio), can't there?