compilation error with GCC 11.3.0 + CUDA 11.7.0

boegel commented 1 year ago

I'm trying to build dorado 0.3.4 from source using GCC 11.3.0 + CUDA 11.7.0, and I'm hitting the following compilation error:

/tmp/easybuild_build/dorado/0.3.4/foss-2022a-CUDA-11.7.0/dorado/dorado/nn/CudaCRFModel.cpp: In member function void dorado::CudaCaller::cuda_thread_fn():
/tmp/easybuild_build/dorado/0.3.4/foss-2022a-CUDA-11.7.0/dorado/dorado/nn/CudaCRFModel.cpp:283:45: error: struct c10::cuda::CUDACachingAllocator::DeviceStats has no member named requested_bytes; did you mean reserved_bytes?
  283 |                     print_stat(device_stats.requested_bytes), device_stats.num_alloc_retries,
      |                                             ^~~~~~~~~~~~~~~
      |                                             reserved_bytes
make[2]: *** [CMakeFiles/dorado_lib.dir/build.make:800: CMakeFiles/dorado_lib.dir/dorado/nn/CudaCRFModel.cpp.o] Error 1

Is this a known problem, should I use a different CUDA (or GCC) version, or am I overlooking something else?

malton-ont commented 1 year ago

DeviceStats::requested_bytes was introduced in libtorch 2.0. Are you linking against your own version of libtorch rather than the version that the dorado configuration process downloads?

boegel commented 1 year ago

@malton-ont Thanks for the feedback!

Yes, we are installing dorado on top of a PyTorch 1.12.0 we installed ourselves here, since we prefer to have control over which version is used, and how it gets built (and because we try hard to avoid that stuff gets downloaded on the fly during an installation because that complicates reproducing that same installation later).

Is there an overview of which PyTorch versions dorado 0.3.4 is compatible with?

tijyojwad commented 1 year ago

Hi @boegel - dorado depends on a custom build of pytorch 2.0 that we host on our CDN because we need static libraries. This custom build gets downloaded when the dorado build setup runs.

We still support building from the PyTorch hosted package, but it needs to be enabled with -D TRY_USING_STATIC_TORCH_LIB=0 when setting up cmake. The exact supported PyTorch version is specified here.

If your aim is to have reproducible builds, I would recommend using one of the pre-built dorado releases since all the dependencies (other than standard host libs) are packaged together. So that build is fixed and will be reproducible.

Is there a reason you're doing custom builds?

boegel commented 10 months ago

@tijyojwad The main reason we're doing custom builds of Dorado and its dependencies is performance: we use compiler options like -march=native so that the binaries obtained are optimized for the CPUs on which they will be used, which can result in significant performance improvements.

In addition, especially for PyTorch, the EasyBuild community does a significant effort to try and get the (massive) PyTorch test suite to pass on our custom PyTorch installation, so we're very reluctant to use a different PyTorch.

Thanks for the pointers on the requirement for -DTRY_USING_STATIC_TORCH_LIB=0, that's very helpful.

Can you elaborate why you prefer using static libraries? Does that just make things easier w.r.t. packaging for Dorado?

tijyojwad commented 10 months ago

Yes static libraries are primarily for reducing the size of the distributed build, since we minimize which torch libraries are packaged. It also reduces the likelihood of it interfering with existing torch installation. And for a given pre-built version of Dorado, all dependencies are fixed (i.e. we don't download any on the fly).

From a users perspective, I think dorado dependencies should be treated as a black box (whether it's depending on torch or not is a dorado implementation detail). Going down the path of making dorado use your own version of torch would be a non-trivial undertaking, and I think it might be better to depend on specific dorado versions (and have a process to validate upgrades) rather than lock down dorado's dependencies.

nanoporetech / dorado

compilation error with GCC 11.3.0 + CUDA 11.7.0 #364