Open laraPPr opened 2 months ago
I'm looking into this myself, seems to be pretty trivial
I'm also looking into updating Horovod
for TensorFlow
2.15.1 with foss/2023a
, but it's being a PITA:
/apps/gent/RHEL8/cascadelake-ampere-ib/software/TensorFlow/2.15.1-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/tensorflow/include/tensorflow/compiler/tf2xla/host_compute_metadata.pb.h:17:2: errorr
: #error This file was generated by an older version of protoc which is
17 | #error This file was generated by an older version of protoc which is
| ^~~~~
/apps/gent/RHEL8/cascadelake-ampere-ib/software/TensorFlow/2.15.1-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/tensorflow/include/tensorflow/compiler/tf2xla/host_compute_metadata.pb.h:18:2: errorr
: #error incompatible with your Protocol Buffer headers. Please
18 | #error incompatible with your Protocol Buffer headers. Please
| ^~~~~
/apps/gent/RHEL8/cascadelake-ampere-ib/software/TensorFlow/2.15.1-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/tensorflow/include/tensorflow/compiler/tf2xla/host_compute_metadata.pb.h:19:2: errorr
: #error regenerate this file with a newer version of protoc.
19 | #error regenerate this file with a newer version of protoc.
| ^~~~~
PR for spektral
with fosscuda/2020b
(since getting Horovod
working with foss/2023a
is providing to be difficult):
I'm also looking into updating
Horovod
forTensorFlow
2.15.1 withfoss/2023a
, but it's being a PITA:/apps/gent/RHEL8/cascadelake-ampere-ib/software/TensorFlow/2.15.1-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/tensorflow/include/tensorflow/compiler/tf2xla/host_compute_metadata.pb.h:17:2: error 17 | #error This file was generated by an older version of protoc which is 18 | #error incompatible with your Protocol Buffer headers. Please 19 | #error regenerate this file with a newer version of protoc.
This header is generated by the protobuf compiler (protoc
) during the TF build using this rule: https://github.com/tensorflow/tensorflow/blob/c16161a1cb6ecdef55bf8fc4a2074a5aa8bd4ed0/tensorflow/compiler/tf2xla/BUILD#L106-L114
Protobuf (like too many others) is downloaded during the build of TF but not installed (only the generated files are required. Possibly there is some runtime library, but not sure)
According to this they use protobuf 3.21.9
Hence I guess the solution to the error is to use the same protobuf version as a build dependency in the EC that causes the above error (which one was that?)
It might also be worth looking into using our protobuf as a "SYSTEM_LIB" during the TF build but I expect some patch to be required. At least we can update the issue if we still run into "File already exists in database" and hope they finally answer it.
Keeping this open for now, would like to look into the Horovod issue again...
I looked into trying with protobuf
3.21.9 as build dependency for Horovod, but didn't get very far since something already depends on a different version of protobuf
, leading to:
A different version of the 'protobuf' module is already loaded (see output of 'ml').
You should load another 'protobuf-python' module for that is compatible with the currently loaded version of 'protobuf'.
Use 'ml spider protobuf-python' to get an overview of the available versions.
If you don't understand the warning or error, contact the helpdesk at hpc@ugent.be
While processing the following module(s):
Module fullname Module Filename
--------------- ---------------
protobuf-python/4.24.0-GCCcore-12.3.0 /modules/all/protobuf-python/4.24.0-GCCcore-12.3.0.lua
grpcio/1.57.0-GCCcore-12.3.0 /modules/all/grpcio/1.57.0-GCCcore-12.3.0.lua
TensorFlow/2.15.1-foss-2023a-CUDA-12.1.1 /modules/all/TensorFlow/2.15.1-foss-2023a-CUDA-12.1.1.lua
That's because we have Lmod configured with LMOD_DISABLE_SAME_NAME_AUTOSWAP
I got the installation of Horovod
working, but only through a nasty hack in the easyconfig, since listing an alternative protobuf
version in builddependencies
doesn't work:
easyblock = 'PythonBundle'
name = 'Horovod'
version = '0.28.1'
local_tf_version = '2.15.1'
local_cuda_suffix = '-CUDA-%(cudaver)s'
versionsuffix = local_cuda_suffix + '-TensorFlow-%s' % local_tf_version
homepage = 'https://github.com/uber/horovod'
description = "Horovod is a distributed training framework for TensorFlow."
toolchain = {'name': 'foss', 'version': '2023a'}
builddependencies = [
('CMake', '3.26.3'),
# ('protobuf', '3.21.9'),
]
dependencies = [
('Python', '3.11.3'),
('PyYAML', '6.0'),
('CUDA', '12.1.1', '', SYSTEM),
('NCCL', '2.18.3', local_cuda_suffix),
('TensorFlow', local_tf_version, local_cuda_suffix),
]
use_pip = True
sanity_pip_check = True
preinstallopts = 'module swap protobuf/3.21.9-GCCcore-12.3.0 && HOROVOD_WITH_MPI=1 HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL '
preinstallopts += 'HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITHOUT_PYTORCH=1 HOROVOD_WITHOUT_MXNET=1 '
exts_list = [
('cloudpickle', '2.2.1', {
'checksums': ['d89684b8de9e34a2a43b3460fbca07d09d6e25ce858df4d5a44240403b6178f5'],
}),
('horovod', version, {
'patches': ['Horovod-0.28.1_support_flatbuffers_2.0.6.patch'],
'checksums': [
'92a43f5a94c43907a56805bad15f19700c62ffc83b7ca483f9e104e229f67ef0',
'9696ffb3b2bad1d6dd5a9f37bc58078ca7c585f933bcbec037036ad9fc0b297d',
],
}),
]
sanity_check_paths = {
'files': ['bin/horovodrun'],
'dirs': ['lib/python%(pyshortver)s/site-packages'],
}
sanity_check_commands = ["horovodrun --help"]
moduleclass = 'tools'
see the module swap
command in preinstallopts
.
In order to do this properly, we would need to add support to EasyBuild framework to swap in a particular module for a (build) dependency rather than just loading it...
I'm wondering if this can cause further problems as there is a runtime library for protobuf. Let's hope they are "compatible enough"
I.e. the TF header expects protobuf 3.21.0-3.21.9 but we use 4.24.0 (at runtime). And I found
The runtime library must have the same version with the protocol compiler you use.
So I guess we are actually confined by the protobuf version used by TensorFlow and need to use the same one in all dependents and dependencies. The best approach is likely to fix and use our protobuf when building TF. Otherwise we need to always check the version in the TF sources before deciding on one for the toolchain.
Isn't the protobuf used by TensorFlow "baked in", so that whatever protobuf module is loaded doesn't really matter for TensorFlow itself?
Not sure how it could be. Possibly it is statically linked, can't remember. But even when linking statically there can be symbol clashes if a shared protobuf is loaded (e.g. by grpcio), can't there?
foss/2023a
or olderPythonPackage