pytorch / tensorpipe

A tensor-aware point-to-point communication primitive for machine learning
Other
249 stars 75 forks source link

Support for 3.0 Linux kernels. #327

Open bryan-lunt opened 3 years ago

bryan-lunt commented 3 years ago

Supercomputers, in this case NCSA Blue Waters, often have older Linux kernels. I was unable to compile tensorpipe on Blue Waters.

Linux YYYYYYY 3.0.101-0.46.1_1.0502.8871-cray_gem_s #1 SMP Thu Aug 29 16:06:17 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

It would be nice if the build system could detect the kernel version and fall back if necessary. Thanks.

I haven't (yet) been able to compile pytorch 1.8.0 because of this.

lw commented 3 years ago

TensorPipe tries to be aligned with PyTorch's support for OS and compilers. Currently PyTorch targets Ubuntu 16.04 as its "oldest" system and that's the reference we've been using too. We're not opposed to supporting even older systems, as long as the changes to do so aren't too invasive. (We're a bit less eager to support older compilers because we'd really like to adopt some of the newer C++ features).

If you could tell me exactly what build issues you're experiencing we could look into them. Are you building PyTorch or TensorPipe directly? Could you try building the master version of PyTorch? Since version 1.8 we have already fixed a few issues that may be at play here (e.g., https://github.com/pytorch/tensorpipe/pull/305).

bryan-lunt commented 3 years ago

Yes, that's the specific problem I've been encountering. I was trying to just compile from a pytorch sourcetreee, but when that failed on a tensorpipe source file, I switched to trying to see if I could compile tensorpipe directly.

My toolchain is

-gcc/5.3.0 -cmake/3.9.4 -cudatoolkit/9.1.85_3.10-1.0502.df1cc54.3.1 -cray-libsci/18.12.1

Which was leftover from the last time I compiled pytorch, v1.4.0. I will have to check the dependencies on the system, but there was some reason for these choices previously... For example, I might try a newer compiler, but then I can't link to cuda, etc. etc.

I usually try to build against system installed libraries as much as possible, and I'm not sure if the old cards support CUDA newer than 9.1 . I may be revealing great ignorance, but usually dependencies on supercomputers can be painful. In any case, the admins have not installed a cudatoolkit newer than 9.1 , whether that's because of card support or not, I can't say.

lw commented 3 years ago

I'm not sure if your toolchain is officially supported, however it's not currently tested: the PyTorch CI (and our) uses GCC 5.4 and CUDA 9.2 as the oldest versions. It's possible that things might still work, but there's no guarantee. Note that things will definitely stop working very soon, as PyTorch is planning to start using C++-17, which is not supported by GCC 5.3. Hence v1.8 might be the last version you're able to build with that toolchain. (Although perhaps you can use Conda to get a newer user-space toolchain? It depends how your system is set up I guess)

Anyways, have you been able to build the TensorPipe and/or PyTorch from their master branches (where the above issue is fixed)? If so I can backport that fix to v1.8.1.

bryan-lunt commented 3 years ago

I was able to build TensorPipe, but I had to further modify the code. One of the changes is straightforward and can be included, but the other was just a hack to make it compile immediately and will create problems with temporary files.

diff --git a/tensorpipe/common/system.cc b/tensorpipe/common/system.cc
index d2bd388..914fc3d 100644
--- a/tensorpipe/common/system.cc
+++ b/tensorpipe/common/system.cc
@@ -262,7 +262,11 @@ optional<std::string> getPermittedCapabilitiesID() {

 void setThreadName(std::string name) {
 #ifdef __linux__
+  #ifdef _GNU_SOURCE
+  #if ((__GLIBC__ > 2) || ((__GLIBC__ == 2) && (__GLIBC_MINOR__ >= 12))) 
   pthread_setname_np(pthread_self(), name.c_str());
+  #endif
+  #endif
 #endif
 }

The other was that I modified tensorpipe/util/shm/segment.cc because O_TMPFILE isn't available. But it needs a complete overhaul to create temp files in the temp directory so that they don't just build up.

bryan-lunt commented 3 years ago

Yes, we could build an entire user-installation for compilers and libc differences, though then we might have trouble linking to the Cray provided MPI etc. But what can't be handled that way are syscalls that don't exist on the older kernel.

On Tue, Mar 23, 2021, 02:59 Luca Wehrstedt @.***> wrote:

I'm not sure if your toolchain is officially supported, however it's not currently tested: the PyTorch CI (and our) uses GCC 5.4 and CUDA 9.2 as the oldest versions. It's possible that things might still work, but there's no guarantee. Note that things will definitely stop working very soon, as PyTorch is planning to start using C++-17, which is not supported by GCC 5.3. Hence v1.8 might be the last version you're able to build with that toolchain. (Although perhaps you can use Conda to get a newer user-space toolchain? It depends how your system is set up I guess)

Anyways, have you been able to build the TensorPipe and/or PyTorch from their master branches (where the above issue is fixed)? If so I can backport that fix to v1.8.1.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pytorch/tensorpipe/issues/327#issuecomment-804698471, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAASSW7NQRV7S4JFWKHC7RTTFBC4JANCNFSM4ZTXEJ3A .

lw commented 3 years ago

We can add that check around the pthread_setname_np call, and try to backport it to PyTorch 1.8.1, no problem. However I think the changes to open(..., O_TMPFILE) would be too invasive and they wouldn't be accepted into PyTorch 1.8.1. However, you should be able to sidestep that issue by setting the TP_ENABLE_SHM variable to OFF in CMake (this also work when done from PyTorch). This would mean you don't build the shared-memory transport, but everything else should build and work fine. Is this something you can do (in your own local checkout), or do you need to strictly build "vanilla" PyTorch, with no local changes?

lw commented 3 years ago

I implemented your change above differently, to ensure it doesn't affect other toolchains. Could you check out https://github.com/pytorch/tensorpipe/pull/328 and confirm it still works for you?

bryan-lunt commented 3 years ago

There's a logic problem with your macro, it goes to the alternative (which still calls the offending function) if the version of GLIBC is too old.

This is ugly, but works:

#ifdef __linux__
// In glibc this non-standard call was added in version 2.12, hence we guard it.
#if defined(__GLIBC__)
#if     ((__GLIBC__ > 2) || ((__GLIBC__ == 2) && (__GLIBC_MINOR__ >= 12)))
  pthread_setname_np(pthread_self(), name.c_str());
// In other standard libraries we didn't check yet, hence we always enable it.
#else
//Old GLIBC
#endif

#else
//non GLIBC
  pthread_setname_np(pthread_self(), name.c_str());
#endif
#endif
bryan-lunt commented 3 years ago

Thanks for doing this for me, BTW.

bryan-lunt commented 3 years ago

It seems to build when I turn shared-memory transport off. What are the implications of that though? Yes, I just want to get pytorch to build at all first, but ultimately maybe it would be nice to do distributed training.

MrChill commented 3 years ago

Thanks, I have struggled the last days also with the same problem. I want to compile from source to use MPI support. Kernel version: Linux 3.10.0-1127.19.1.el7.x86_64

Would be really nice if there is a solution.

with set(TP_ENABLE_SHM OFF) in CMakeLists.txt still:

Scanning dependencies of target tensorpipe [ 40%] Building CXX object tensorpipe/CMakeFiles/tensorpipe.dir/channel/error.cc.o [ 41%] Building CXX object tensorpipe/CMakeFiles/tensorpipe.dir/channel/helpers.cc.o In file included from /home/mos7rng/Projects/KGModel_ddp/pytorch/third_party/tensorpipe/third_party/libnop/include/nop/status.h:22:0, from /home/mos7rng/Projects/KGModel_ddp/pytorch/third_party/tensorpipe/third_party/libnop/include/nop/base/encoding.h:30, from /home/mos7rng/Projects/KGModel_ddp/pytorch/third_party/tensorpipe/third_party/libnop/include/nop/base/array.h:22, from /home/mos7rng/Projects/KGModel_ddp/pytorch/third_party/tensorpipe/third_party/libnop/include/nop/serializer.h:20, from /home/mos7rng/Projects/KGModel_ddp/pytorch/third_party/tensorpipe/tensorpipe/common/nop.h:11, from /home/mos7rng/Projects/KGModel_ddp/pytorch/third_party/tensorpipe/tensorpipe/channel/helpers.h:14, from /home/mos7rng/Projects/KGModel_ddp/pytorch/third_party/tensorpipe/tensorpipe/channel/helpers.cc:9: /home/mos7rng/Projects/KGModel_ddp/pytorch/third_party/tensorpipe/third_party/libnop/include/nop/types/result.h:68:31: error: ‘decay_t’ is not a member of ‘std’ static_assert(!std::is_same<std::decay_t<ErrorEnum>, std::decay_t<T>>::value,

lw commented 3 years ago

There's a logic problem with your macro, it goes to the alternative (which still calls the offending function) if the version of GLIBC is too old.

Right, that was a gross oversight, sorry. I fixed it now and I am landing the change. It may take a bit for it to land into PyTorch, I hope we can make it by v1.8.1.

It seems to build when I turn shared-memory transport off. What are the implications of that though?

It means that when using PyTorch's RPC package to communicate between processes on the same machine you wouldn't be able to use the a low-latency shared-memory based backend, and instead you would fall back onto a TCP-based one, which is slightly less performant. Hence you wouldn't experience functional differences, just a possible performance drop in some specific cases.

with set(TP_ENABLE_SHM OFF) in CMakeLists.txt still:

The most likely explanation seems to be that your compiler doesn't support C++-14. What compiler are you using? If that's the case, I suspect you will be hitting many more compilation errors after that one, as we've been using C++-14 extensively throughout our codebase, and so has done PyTorch. However, updating your compiler should be easy.

bryan-lunt commented 3 years ago

@MrChill MrChill, what machine are you on? I'm jealous of your new kernel...

MrChill commented 3 years ago

Thank you! Sometimes it can be so easy...

GCC-7.4.0 did the trick 👍

@bryan-lunt I am on a computing cluster ;-)

bryan-lunt commented 3 years ago

In the end, I can build it in a docker container that is old enough that it has a glibc that works with our 3.0 kernel, but new enough that the glibc can support pytorch.

Lots of hours spent on this project.

lw commented 3 years ago

Glad you managed to find a solution, and thanks for sharing it for the benefit of other users who may be hitting it. And sorry this took a lot of your time.

In the end the release for PyTorch v1.8.1 was cut before I could upstream that fix, hence a vanilla v1.8.1 will probably still not compile on Linux 3.0 and/or old glibc. In addition to using Docker, the two tweaks above (applying that fix and disabling the SHM transport) are also viable options to get it to work again.

bryan-lunt commented 3 years ago

Well, at least I have a general solution that can mostly continue to be used into the future for other users on our system. I basically just took the Dockerfiles for building pytorch docker images and edited them to base themselves off of older base images from nvidia/cuda (ubuntu 16.04 instead of 18.xx). Then some other tweaks and software installations, tuning specific for our system. etc.

It sounds and looks easy when it's done. It was not insanely difficult, but it was very time-consuming though.

https://github.com/bryan-lunt-supercomputing/blue-waters-pytorch

https://hub.docker.com/repository/docker/luntlab/bw-pytorch/general