habitat-sh / core-plans

Core Habitat Plan definitions
130 stars 252 forks source link

[cuda] requires gcc7 #2325

Open bdangit opened 5 years ago

bdangit commented 5 years ago

I’d like to have core/gcc7 so that my plan core/cuda depends on this. Gcc8 is not compatible with cuda which means I also can’t compile any cuda things until cuda updates their support for gcc8 or greater.

3 Pieces of evidence that is proving we need gcc7.

  1. Documentation from nvidia shows what is supported: https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#system-requirements

  2. If users try to compile cuda apps they will see this error:

    [ 50%] Building NVCC (Device) object src.gpu/CMakeFiles/gpuSquareDemo.dir/gpuSquareDemo_generated_main.cu.o
    In file included from /hab/pkgs/core/cuda/9.2.148/20190117185451/include/host_config.h:50,
                 from /hab/pkgs/core/cuda/9.2.148/20190117185451/include/cuda_runtime.h:78,
                 from <command-line>:
    /hab/pkgs/core/cuda/9.2.148/20190117185451/include/crt/host_config.h:119:2: error: #error -- unsupported GNU version! gcc versions later than 7 are not supported!
    #error -- unsupported GNU version! gcc versions later than 7 are not supported!
    ^~~~~
    CMake Error at gpuSquareDemo_generated_main.cu.o.cmake:219 (message):
    Error generating
    /src/cuda-cmake-example/build/src.gpu/CMakeFiles/gpuSquareDemo.dir//./gpuSquareDemo_generated_main.cu.o

Even if you try to modify the header to ignore this, a user will then be subjected to the following error:

[ 50%] Building NVCC (Device) object src.gpu/CMakeFiles/gpuSquareDemo.dir/gpuSquareDemo_generated_main.cu.o
/hab/pkgs/core/gcc/8.2.0/20190115004042/include/c++/8.2.0/type_traits(1049): error: type name is not allowed

/hab/pkgs/core/gcc/8.2.0/20190115004042/include/c++/8.2.0/type_traits(1049): error: type name is not allowed

/hab/pkgs/core/gcc/8.2.0/20190115004042/include/c++/8.2.0/type_traits(1049): error: identifier "__is_assignable" is undefined

3 errors detected in the compilation of "/tmp/tmpxft_0000106e_00000000-6_main.cpp1.ii".
CMake Error at gpuSquareDemo_generated_main.cu.o.cmake:279 (message):
  Error generating file
  /src/cuda-cmake-example/build/src.gpu/CMakeFiles/gpuSquareDemo.dir//./gpuSquareDemo_generated_main.cu.o
  1. Archlinux maintains a gcc7 package which is used by their cuda package: https://www.archlinux.org/packages/community/x86_64/cuda/

Is there something that prevents us from seeing a core/gcc7?

bdangit commented 5 years ago

I created a gcc7 and gcc7-libs. It uses core/gcc to build them. I then rebuilt cuda using gcc7, then built a cuda app with it.

I was very successful to do this. There does not appear to be any weird side-effects.

bdangit commented 5 years ago

submitted #2338 to get the ball rolling on this.

smacfarlane commented 5 years ago

This is something that was considered as part of the previous base-plans refresh, but we decided that we cannot have a gcc7(-libs) package in core for runtime safety reasons.

Adding a gcc7 plan would be a relatively safe operation as it would be primarily used at build time, but would necessitate the addition of a gcc7-libs package. With gcc7-libs, you could potentially end up in a state where you have two different versions of gcc libraries that your software would consume at runtime leading to unexpected behavior.

Consider two fictional packages, A and B. A has a runtime dependency on gcc-libs and B. B has a runtime dependency on gcc7-libs. When you execute a command from A, it will start to load the libraries required. Depending on the load order, it will load libraries from one of the gcc-libs packages first. When it tries to load libraries from the other gcc-libs package, it will see that the libraries with that name are already loaded. At this point, your software or some of its library dependencies are utilizing libraries of different versions than they were built against.

Since gcc-libs and gcc7-libs don't conflict with each other from a package naming perspective, we are unable to guard against that behavior.

Note I haven't read up on Arch's package policies around versions and differences between core/community/extra, so there could be something I'm missing. ArchLinux is able to do this by publishing gcc7 in its community repository, rather than core. My guess is that in their distro core packages can only depend on core packages, which is the same policy we have. Community gives them an extra 'onion layer' (community can depend on core, but not the other way) for things like CUDA and gcc7.

I'm not sure what the 'hab' solution is in this case though, as we haven't had to consider a case like this before. For now, would it make sense to update the README for CUDA, recommending the use of the bdangit/cuda packages with some context of why?

bdangit commented 5 years ago

@smacfarlane, in the A and B scenario, I thought the RPATH that is set in the binary determines the library loading.

So if I hardcode the exact paths of gcc-libs in A and the exact paths of gcc7-libs in B, then there should not be any runtime conflicts. In many plans I have touched or seen, I have used on occasion patchelf --show-path to show the exact hardcoded locations of what lib paths are going to be loaded up.

smacfarlane commented 5 years ago

Sort of. The RPATH tells the loader where to look for libraries, however (to the best of my knowledge) within the scope of a process you can only have one copy of a library loaded in memory at a time.

Looking at the output of readelf -d $(hab pkg path core/postgresql)/bin/psql, we can see that it lists required libraries, followed by the RPATH with gives the predetermined locations to look for them. Note that the NEEDED entries don't contain paths, only the name of the library.

[24][default:/src:0]# readelf -d /hab/pkgs/core/postgresql/9.6.11/20190115202549/bin/psql

Dynamic section at offset 0x88dc8 contains 29 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libpq.so.5]
 0x0000000000000001 (NEEDED)             Shared library: [libreadline.so.7]
 0x0000000000000001 (NEEDED)             Shared library: [libm.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x000000000000000f (RPATH)              Library rpath: [/hab/pkgs/core/postgresql/9.6.11/20190115202549/lib:/hab/pkgs/core/glibc/2.27/20190115002733/lib:/hab/pkgs/core/openssl/1.0.2q/20190115014220/lib:/hab/pkgs/core/perl/5.28.0/20190115013014/lib:/hab/pkgs/core/readline/7.0.3/20190115012607/lib:/hab/pkgs/core/zlib/1.2.11/20190115003728/lib:/hab/pkgs/core/libossp-uuid/1.6.2/20190115171615/lib:/hab/pkgs/core/libxml2/2.9.8/20190115154829/lib:/hab/pkgs/core/geos/3.6.2/20190115171623/lib:/hab/pkgs/core/proj/4.9.3/20190115171931/lib:/hab/pkgs/core/gdal/2.2.1/20190115172005/lib]

When a piece of software tries to load a library, it will check first to see if that library has been loaded. In the above A & B example, since B is loading in the context of A, whichever requests a library from their respective GCC package first will load it and the subsequent calls will see it already loaded. On top of that, I think there some caching of library paths as it loads, so determining what it will find first is difficult to determine.

Taking a look at core/gcc/7.3.0 (pre-refresh) and core/gcc/8.2.0 (stable), you can see that the library names are identical, as far as they are listed in the NEEDED section (using qemu as an example):

[25][default:/src:0]# ls /hab/pkgs/core/gcc/*/*/lib/libstdc++*
/hab/pkgs/core/gcc/7.3.0/20180608051919/lib/libstdc++.a
/hab/pkgs/core/gcc/7.3.0/20180608051919/lib/libstdc++.la
/hab/pkgs/core/gcc/7.3.0/20180608051919/lib/libstdc++.so
/hab/pkgs/core/gcc/7.3.0/20180608051919/lib/libstdc++.so.6
/hab/pkgs/core/gcc/7.3.0/20180608051919/lib/libstdc++.so.6.0.24
/hab/pkgs/core/gcc/7.3.0/20180608051919/lib/libstdc++.so.6.0.24-gdb.py
/hab/pkgs/core/gcc/7.3.0/20180608051919/lib/libstdc++fs.a
/hab/pkgs/core/gcc/7.3.0/20180608051919/lib/libstdc++fs.la
/hab/pkgs/core/gcc/8.2.0/20190115004042/lib/libstdc++.a
/hab/pkgs/core/gcc/8.2.0/20190115004042/lib/libstdc++.la
/hab/pkgs/core/gcc/8.2.0/20190115004042/lib/libstdc++.so
/hab/pkgs/core/gcc/8.2.0/20190115004042/lib/libstdc++.so.6
/hab/pkgs/core/gcc/8.2.0/20190115004042/lib/libstdc++.so.6.0.25
/hab/pkgs/core/gcc/8.2.0/20190115004042/lib/libstdc++.so.6.0.25-gdb.py
/hab/pkgs/core/gcc/8.2.0/20190115004042/lib/libstdc++fs.a
/hab/pkgs/core/gcc/8.2.0/20190115004042/lib/libstdc++fs.la
[26][default:/src:0]# readelf -d /hab/pkgs/core/qemu/2.11.1/20190302015747/bin/qemu-system-x86_64

Dynamic section at offset 0xb719b8 contains 48 entries:
  Tag        Type                         Name/Value
...
 0x0000000000000001 (NEEDED)             Shared library: [libstdc++.so.6]

Either version of gcc above would, I believe, satisfy the libstdc++.so.6 requirement for qemu-system-x86_64. The symbols defined in each are likely slightly different so it might work, or you could get missing symbol errors at run time or segfaults. If this was coming from a package in core, a user would have a broken experience, with no (easy) recourse to resolve it.

I need to do some additional reading on how the Arch community is using the community gcc7-libs package to see if I'm missing something, but my suspicion is that there isn't an easy solve to this.

bdangit commented 5 years ago

@smacfarlane, I took a deeper dive in understanding the loader. I disagree with your assessment because that would mean my habitatized bins would not operate on bare metal or any where but a "HabOS". It also would mean that all the binaries in the entire world would not be able to run on any single OS (forget Habitat).

I ran 2 bins in a Linux VM (not in a studio):

[vagrant@localhost bin]$ ldd /hab/pkgs/core/cppcheck/1.86/20190116223357/bin/cppcheck
        linux-vdso.so.1 =>  (0x00007ffe8c3b9000)
        libpcre.so.1 => /hab/pkgs/core/pcre/8.42/20190115012526/lib/libpcre.so.1 (0x00007fe8e9271000)
        libstdc++.so.6 => /hab/pkgs/core/gcc-libs/8.2.0/20190115011926/lib/libstdc++.so.6 (0x00007fe8e8ed2000)
        libm.so.6 => /hab/pkgs/core/glibc/2.27/20190115002733/lib/libm.so.6 (0x00007fe8e8d3f000)
        libgcc_s.so.1 => /hab/pkgs/core/gcc-libs/8.2.0/20190115011926/lib/libgcc_s.so.1 (0x00007fe8e9256000)
        libc.so.6 => /hab/pkgs/core/glibc/2.27/20190115002733/lib/libc.so.6 (0x00007fe8e8b87000)
        libpthread.so.0 => /hab/pkgs/core/glibc/2.27/20190115002733/lib/libpthread.so.0 (0x00007fe8e9235000)
        /hab/pkgs/core/glibc/2.27/20190115002733/lib/ld-linux-x86-64.so.2 => /lib64/ld-linux-x86-64.so.2 (0x00007fe8e90da000)
[vagrant@localhost bin]$ LD_DEBUG=libs /hab/pkgs/core/cppcheck/1.86/20190116223357/bin/cppcheck
      7846:     find library=libpcre.so.1 [0]; searching
      7846:      search path=/hab/pkgs/core/glibc/2.27/20190115002733/lib/tls/x86_64/x86_64:/hab/pkgs/core/glibc/2.27/20190115002733/lib/tls/x86_64:/hab/pkgs/core/glibc/2.27/20190115002733/lib/tls/x86_64:/hab/pkgs/core/glibc/2.27/20190115002733/lib/tls:/hab/pkgs/core/glibc/2.27/20190115002733/lib/x86_64/x86_64:/hab/pkgs/core/glibc/2.27/20190115002733/lib/x86_64:/hab/pkgs/core/glibc/2.27/20190115002733/lib/x86_64:/hab/pkgs/core/glibc/2.27/20190115002733/lib          (system search path)
      7846:       trying file=/hab/pkgs/core/glibc/2.27/20190115002733/lib/tls/x86_64/x86_64/libpcre.so.1
      7846:       trying file=/hab/pkgs/core/glibc/2.27/20190115002733/lib/tls/x86_64/libpcre.so.1
      7846:       trying file=/hab/pkgs/core/glibc/2.27/20190115002733/lib/tls/x86_64/libpcre.so.1
      7846:       trying file=/hab/pkgs/core/glibc/2.27/20190115002733/lib/tls/libpcre.so.1
      7846:       trying file=/hab/pkgs/core/glibc/2.27/20190115002733/lib/x86_64/x86_64/libpcre.so.1
      7846:       trying file=/hab/pkgs/core/glibc/2.27/20190115002733/lib/x86_64/libpcre.so.1
      7846:       trying file=/hab/pkgs/core/glibc/2.27/20190115002733/lib/x86_64/libpcre.so.1
      7846:       trying file=/hab/pkgs/core/glibc/2.27/20190115002733/lib/libpcre.so.1
      7846:      search path=/hab/pkgs/core/gcc-libs/8.2.0/20190115011926/lib/tls/x86_64/x86_64:/hab/pkgs/core/gcc-libs/8.2.0/20190115011926/lib/tls/x86_64:/hab/pkgs/core/gcc-libs/8.2.0/20190115011926/lib/tls/x86_64:/hab/pkgs/core/gcc-libs/8.2.0/20190115011926/lib/tls:/hab/pkgs/core/gcc-libs/8.2.0/20190115011926/lib/x86_64/x86_64:/hab/pkgs/core/gcc-libs/8.2.0/20190115011926/lib/x86_64:/hab/pkgs/core/gcc-libs/8.2.0/20190115011926/lib/x86_64:/hab/pkgs/core/gcc-libs/8.2.0/20190115011926/lib:/hab/pkgs/core/pcre/8.42/20190115012526/lib/tls/x86_64/x86_64:/hab/pkgs/core/pcre/8.42/20190115012526/lib/tls/x86_64:/hab/pkgs/core/pcre/8.42/20190115012526/lib/tls/x86_64:/hab/pkgs/core/pcre/8.42/20190115012526/lib/tls:/hab/pkgs/core/pcre/8.42/20190115012526/lib/x86_64/x86_64:/hab/pkgs/core/pcre/8.42/20190115012526/lib/x86_64:/hab/pkgs/core/pcre/8.42/20190115012526/lib/x86_64:/hab/pkgs/core/pcre/8.42/20190115012526/lib                (RPATH from file /hab/pkgs/core/cppcheck/1.86/20190116223357/bin/cppcheck)
...
[vagrant@localhost bin]$ LD_DEBUG=libs ./gpuSquareDemo
      7848:     find library=libpthread.so.0 [0]; searching
      7848:      search path=/hab/pkgs/bdangit/gcc7-libs/7.3.0/20190302224838/lib/tls/x86_64/x86_64:/hab/pkgs/bdangit/gcc7-libs/7.3.0/20190302224838/lib/tls/x86_64:/hab/pkgs/bdangit/gcc7-libs/7.3.0/20190302224838/lib/tls/x86_64:/hab/pkgs/bdangit/gcc7-libs/7.3.0/20190302224838/lib/tls:/hab/pkgs/bdangit/gcc7-libs/7.3.0/20190302224838/lib/x86_64/x86_64:/hab/pkgs/bdangit/gcc7-libs/7.3.0/20190302224838/lib/x86_64:/hab/pkgs/bdangit/gcc7-libs/7.3.0/20190302224838/lib/x86_64:/hab/pkgs/bdangit/gcc7-libs/7.3.0/20190302224838/lib          (RPATH from file ./gpuSquareDemo)
      7848:       trying file=/hab/pkgs/bdangit/gcc7-libs/7.3.0/20190302224838/lib/tls/x86_64/x86_64/libpthread.so.0
      7848:       trying file=/hab/pkgs/bdangit/gcc7-libs/7.3.0/20190302224838/lib/tls/x86_64/libpthread.so.0
      7848:       trying file=/hab/pkgs/bdangit/gcc7-libs/7.3.0/20190302224838/lib/tls/x86_64/libpthread.so.0
      7848:       trying file=/hab/pkgs/bdangit/gcc7-libs/7.3.0/20190302224838/lib/tls/libpthread.so.0
      7848:       trying file=/hab/pkgs/bdangit/gcc7-libs/7.3.0/20190302224838/lib/x86_64/x86_64/libpthread.so.0
      7848:       trying file=/hab/pkgs/bdangit/gcc7-libs/7.3.0/20190302224838/lib/x86_64/libpthread.so.0
      7848:       trying file=/hab/pkgs/bdangit/gcc7-libs/7.3.0/20190302224838/lib/x86_64/libpthread.so.0
      7848:       trying file=/hab/pkgs/bdangit/gcc7-libs/7.3.0/20190302224838/lib/libpthread.so.0
      7848:      search path=/hab/pkgs/core/glibc/2.27/20190115002733/lib/tls/x86_64/x86_64:/hab/pkgs/core/glibc/2.27/20190115002733/lib/tls/x86_64:/hab/pkgs/core/glibc/2.27/20190115002733/lib/tls/x86_64:/hab/pkgs/core/glibc/2.27/20190115002733/lib/tls:/hab/pkgs/core/glibc/2.27/20190115002733/lib/x86_64/x86_64:/hab/pkgs/core/glibc/2.27/20190115002733/lib/x86_64:/hab/pkgs/core/glibc/2.27/20190115002733/lib/x86_64:/hab/pkgs/core/glibc/2.27/20190115002733/lib          (system search path)
      7848:       trying file=/hab/pkgs/core/glibc/2.27/20190115002733/lib/tls/x86_64/x86_64/libpthread.so.0
      7848:       trying file=/hab/pkgs/core/glibc/2.27/20190115002733/lib/tls/x86_64/libpthread.so.0
      7848:       trying file=/hab/pkgs/core/glibc/2.27/20190115002733/lib/tls/x86_64/libpthread.so.0
      7848:       trying file=/hab/pkgs/core/glibc/2.27/20190115002733/lib/tls/libpthread.so.0
      7848:       trying file=/hab/pkgs/core/glibc/2.27/20190115002733/lib/x86_64/x86_64/libpthread.so.0
      7848:       trying file=/hab/pkgs/core/glibc/2.27/20190115002733/lib/x86_64/libpthread.so.0
      7848:       trying file=/hab/pkgs/core/glibc/2.27/20190115002733/lib/x86_64/libpthread.so.0
      7848:       trying file=/hab/pkgs/core/glibc/2.27/20190115002733/lib/libpthread.so.0

You can see that the RPATH does indeed help to determine where to search for the lib required, even though both core/gcc-libs and bdangit/gcc7-libs is both installed in the /hab filesystem. More to that the Linux VM I am running has none of the gcc-libs preloaded:

[vagrant@localhost vagrant]$ cat /etc/ld.so.conf.d/*.conf
# Placeholder file, no vDSO hwcap entries used in this kernel.
/usr/lib64/mysql
/usr/lib/vmware-tools/lib32/libvmGuestLib.so
/usr/lib/vmware-tools/lib64/libvmGuestLib.so
/usr/lib/vmware-tools/lib32/libvmGuestLibJava.so
/usr/lib/vmware-tools/lib64/libvmGuestLibJava.so
/usr/lib/vmware-tools/lib32/libDeployPkg.so
/usr/lib/vmware-tools/lib64/libDeployPkg.so

However, I do believe the problem is if I build a binary and it requires another library that depends on gcc8-libs, then I will definitely run into issues.

Scenario:

mybin depends on
- bdangit/gcc7-libs
- bdangit/cuda-libs
- core/zeromq (shared lib in this depends on core/gcc-libs)

I have not been able to validate this, but I'm pretty sure within this exec context, I will most likely run into seg-faults because of conflicting gcc7 and gcc8 libs being loaded up at the same time.

With that, I agree with you that we should move the cuda stuff and gcc7 things over into bdangit origin or any other origin other than core.

bdangit commented 5 years ago

BTW, this is a very good resource to read: https://cseweb.ucsd.edu/~gbournou/CSE131/the_inside_story_on_shared_libraries_and_dynamic_loading.pdf

I recommend the section on "Library Loading"

smacfarlane commented 5 years ago

@bdangit Thanks for having this discussion, I'm sorry we're unable to support CUDA in core at this time.

I'd like to leave this issue open so we can track any updates to supporting CUDA again in core with the historical context around the problem space.

I also think we should disconnect the CUDA plan from builder for the time being. I don't want to deprecate it at this time, but I think we should update its readme with a link to this issue, links to the gcc7 and cuda packages in your origin, and a brief description of steps needed to build in a users own origin, if they want to.

What do you think?

bdangit commented 5 years ago

Yep. I agree on this plan to move forward. It will be awhile until I can get some on this. Most likely expect something next week.

bdangit commented 5 years ago

Just in, Cuda 10.1 now supports gcc 8.x! https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#cuda-compiler-new-features

However, we have two options:

  1. have a core/cuda, if gcc gets to 9.x but cuda does not, then we would be left with a broken package.

  2. have a myorigin/cuda to track specifically a gcc8. Newer versions of gcc won’t impact this package. Would have to maintain gcc8 and gcc8-libs.

I’m inclined to still go with option 2, despite the benefits of not having to maintain more packages.

@smacfarlane what do you think?

bdangit commented 5 years ago

thinking about this -- I'm planning on making an update to core/cuda for 10.1. If gcc 9.x comes out and hab updates to that, then we should re-evaluate or do the work at that time to get cuda to work. I'm deferring work right now because even getting 10.1 from 9.x is going to take some time.

smacfarlane commented 5 years ago

I'm inclined to agree that option 2 may be better as the Cuda releases tend to be infrequent and probably undergo a long period of testing/stabilization so will always be a bit behind major version updates out of necessity.

I've got in my backlog to start exploring gcc9 so we should able to answer that question soon-ish though.

bdangit commented 5 years ago

Ack. Yes, I also got hit with a gcc 9.1 has been release notice. Aright, let’s go with plan 2. I’ll need to fork the current gcc and then can continue exploring making the cuda 10.1.

bdangit commented 5 years ago

started the work here: https://github.com/bdangit/cuda-plans