BStudent commented 6 years ago

Overview: The Current State

The ubuntu cuda 9.1 deb packages appear to be incomplete - missing libraries, headers, tools, and samples. On closer inspection, while some components - most critically, cuda samples - are indeed not present, most of the remaining issues are due to the fact that expected symbolic links are not being created, expected paths and environment variables are not being set. Much of this is attributable to not following nvidia's best-practices for configuration that are intended to allow for side-by-side installs of different CUDA versions. Some inspection of earlier packages (e.g. 8, 7) indicate that they are all made from the same template and subject to the same issues.
About two weeks after the 18.04 release, NVidia pushed CUDA 9.2.88, which is overall more compatible with 18.04, but only explicitly states support through release 17.10.
I tested CUDA 9.2, installed from nvidia deb packages, with an ubuntu/ppa-packaged nvidia graphics driver for 396.26 that was temporarily available in a maintainer's ppa. I have now set up a mirror of that installer in a separate ppa, detailed in a section below, so that others can reproduce it.
The 396.26 install and subsequent CUDA install each required some minor tweaking by hand in synaptic, but once installed had excellent stability and performance. In particular, the extensive "CUDA samples" Makefile, which is a 10-15 minute build on and Intel i9 X-class processor with 128GB memory, ran to completion without error, and the compute-intensive examples performed well.
It is notable that CUDA-based developers use the installation and make of these sample files as a de facto regression test for correct installation of CUDA and its interaction with other subsystems such as OpenGL.
Moreover, these sample files are not included in the debian-based CUDA packages. They can be downloaded separately from other CUDA code, installed, and built individually or as a group via make: I downloaded the 9.1 version of CUDA samples sources after installing the 9.1 debian-based packages and found that the install process itself identified many apparently missing libraries and misconfigured paths. Running the makefile for CUDA samples against the debian-based CUDA install identified further misconfiguration, and ultimately crashed before completion.
Again, this is the de facto regression test used by CUDA devs. It's good to have one.

jackpot51 commented 6 years ago

If we do get CUDA 9.2 into Pop, we will need to package it on our end.

WatchMkr commented 6 years ago

We'd like to keep this open to track CUDA and nvidia 396. nvidia 396 is in validation and if testing passes, it will be released next week. Once that's available, we can look at packaging CUDA 9.2.

BStudent commented 6 years ago

(updated 20180528 - FINAL)

Build and Run CUDA 9.2 on Pop!OS using pre-release ubuntu ppa nvidia-936~~936.26 and NVidia CUDA 9.2.88 deb packages

Introduction

NVidia's CUDA 9.2 offers many features that enhance compatibility with Ubuntu 18.04 and, although 18.04 is not yet officially supported, the are many reasons CUDA programmers and related devoperators may want to start testing them. Due to a late-breaking security patch, the 9.2.88 CUDA release now depends on nvidia's proprietary driver version 396.26. This unplanned change made 9.2.88 incompatible with the 396.24 drivers that the ubuntu Proprietary GPU drivers ppa had prepared. Since that time, pre-release 396.26 ppa drivers versions have been in testing, and shown to work with Pop!OS 18.04.

Starting point: Clean install of PopOS 18.04 (the following used build 23)

Installed synaptic for managing package conflicts
Installed vscode (any similar programmer's editor will do)

Part 1: Update ubuntu ppa of nvidia proprietary driver to version 396.26

These instructions use a pre-release copy stored on an unofficial / unsafe ppa. Proceed at your own risk.

Where to get the 396.26 driver deb package

There is an unofficial / untrusted ppa hosting a binary deb installer from a recent staging build created in the official ppa as part of pre-release dev. It is available here: https://launchpad.net/~bstudent/+archive/ubuntu/nvidia-graphics-drivers-396.26-copy-of-staging-ppa-20180522 Before using it you should check the maidriversn ppa repos for current versions.

Add the ppa deb package 396.26 driver binaries to local repo

Even though this should be a smooth upgrade (see note), I'm going to add/update the repo at the CLI but then tweak the dependencies in synaptic:

sudo add-apt-repository\
     ppa:bstudent/nvidia-graphics-drivers-396.26-copy-of-staging-ppa-20180522
sudo apt-get update

Normally one would do sudo apt get install whatever-package as the next step, but that was not working for me on early tries for some reason (PEBKAC?). Odd, because I did not have such problems going from 390.48 to 396.24. As a result, I do this part in synaptic:

Manually add new, remove old versions, and avoid unfortunate side effects

Open synaptic
On the category selector bar at left, click the bottom entry Architecture, and at the top of the category selector bar, click All (no filtering, show all packages)

____ Mark the group of packages to install ___
- [ ] 0. In the main window where packages are shown, click on the Latest Version header to sort all packages by latest version. All of the packages we just loaded with apt have a Latest Version field that begins with 396.26, hence grouping the items we'll add as well as those to drop. The text below assumes you are replacing drivers version 390.48, but the procedure is the same for 396.24.
- [ ] 1. Scroll to and select all items in the block of packages with Latest Version == 396.26
- [ ] 2. Right mouse on any of them and choose the option Mark For Install
____ Propagate to dependent or conflicting packages ____
- [ ] 3. Click the toolbar icon labeled Mark All Upgrades but DO NOT select Apply yet.
  - This causes the dependencies of each selected item to likewise be selected for installation.
  - It also causes those packages that would conflict with the selection to be marked for removal.
____ Gather and review all packages marked ____
- [ ] 4. On the bottom-half of the Category Selector, click Custom Filters
- [ ] 5. On the top-half of the Category Selector, click Marked Changes, showing packages we marked or which were marked because they are dependent on marked packages.
- [ ] 6. Items to be installed are marked green, ideally including all version 396.26 packages
- [ ] 7. To-be-removed items are orange / red; Note that some appear to be things we want to keep. We'll come back to them.
- [ ] 8. Clicking Broken in the upper Category Selector shows conflicted packages: some 396.26 packages may appear here pending their dependencies' resolution _____ Regroup, make sure all to-be-removed packages are marked __
- [ ] 9. Next, again sort All packages by Latest Version, to group and check that all 396.26 and 390.48 are marked, even if incorrectly.
- [ ] 10. Mark 390.48 components for removal (NOT complete removal), and 396.26 for installation, and return to the Marked Changes set.
  
  _ Undo "collateral damage" to Pop!OS and System76 components ___
- [ ] 11. Critical Exceptions: A few pop-os / system76 components are marked for removal - right mouse and select Mark For Reinstallation
- [ ] 12. The system76-power package may be broken or resistant to marking for reinstall
- [ ] 13. Mark the (new) component Nvidia-Prime for removal - it is replaced by system76 power.
  - Observe that this is an important case: a "new" file was set to take out a key existing file we did non know had a "rival"
  - Other than "whitelisting" every package we might want to keep - likewise a slippery slope - there is no substitute for for thorough initial testing and careful observation.
- [ ] 14. Finally, there may be some unmarked or unremoved "stragglers" among the 390.48 components, which is acceptable if they are not blocking installation of a corresponding 396.26 component.
  
  __ Cleanup and Convergence ___
- [ ] 15. Note that stragglers are particularly likely among 32-bit components, which are no longer supported.
- [ ] 16. As flags are changed and dependencies align, try Mark All Upgrades again to verify that the new plan is stable and not creating any more new dependencies.

There should be no circular dependencies or deadlocks preventing quick convergence of packages to an acceptable plan. At that point, hit the Apply button and then reboot.

Part 2: Installing the CuDA components

... including differently-packaged drivers with the same version as we just installed

Part I: Obtaining the source materials

Note that this build is not using the same CUDA packaging previously adopted, first because a CUDA 9.2 version doesn't exist, and second, for other reasons. We'll do two things differently:

First, we're going to base the installation of CUDA directly on the prprietary CUDA Toolkit deb package built for the Ubuntu platform by Nvidia:

[ ] 1. To download the installer package for CUDA Toolkit 9.2, we start from the NVidia developer downloads site:
https://developer.nvidia.com/cuda-downloads
... and click the appropriate series of buttons to get to the specific [Download] button for:
Linux → x86_64 → Ubuntu → 17.10 → deb[network]
Note that for any given Linux (arch / distro / release) tuple, Nvidia offers three installers to visitors on first contact: be certain that the deb[network] installer above is selected; to be doubly sure, semply click the following link, and click the [Download] button when presented:
https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1710&target_type=debnetwork

Second, by choosing to work with the proprietary installer we have stepped into the land of the non-free: as is customary in this place, we must seek compensation. It comes in the form of extensive docs.

[ ] 2. Locate the [Documentation > ] button, leftmost in a row of four large buttons located well beneath the [Download] button clicked previously (if on a small screen you may need to scroll down). For those who enjoy video-learning, below the [Documentation > ] button you will see the words "Get Started" writ large above an array of colorful video tutorials that offer to show how to install CUDA, how to learn a bit of CUDA Programming (some are quite good), and how to desire the features of hardware you do not yet own. NOW click the [Documentation > ] button.

The landing page is an index to about fifty volumes of professionally-written software documentation comprising on the order of 10,000 total pages (most available as pdf). To offer some perspective, this collection contains only documentation immediately pertinent to the technical details of CUDA, and does not include material specific to other business lines such as Deep Learning, Computer Vision, nor Nvidia's principal business of computer graphics.

Fortunately, the only thing we need be concerned with now is the link below to the pdf version of the NVida CUDA Installation Guide for Linux: Installation and Verification on Linux Systems.

[ ] 3. Click the link to download the Installation Guide, and consider that it promises guidance not only on installaton, but also Verification: https://docs.nvidia.com/cuda/pdf/CUDA_Installation_Guide_Linux.pdf

This is the primary source for CUDA installation and it may prove useful to highlight and annotate it in one's pdf viewer of choice - it runs forty pages and covers critical topics including:

Naming and homing conventions for lib, bin, hdr, src asset, and toolchain elements, so that their status and location are available to authorized people and software.
How installer parameters can be used to alter resource homes globally, consistently, and safely.
Side-by-side homing of multiple CUDA releases under Version-bound and-or Debug-Release -based location names for users or applications (as with many learning frameworks) that are anchored to a specific CUDA release.
Recommended and best-practices for working with open-source libraries for X, GL, or both.
[ ] Also keep a pdf copy of the CUDA Quick Start Guide handy - note this document covers all platforms, and has two separate Linux sections (a separate one for Power8/Power9): https://docs.nvidia.com/cuda/pdf/CUDA_Quick_Start_Guide.pdf
The table below, excerpted from NVida CUDA Installation Guide for Linux, partially documents the deb packages available via the repo we connected to above (including the cuda package we'll try to install)
Also note that if we give up on trying to install cuda bundled with its differently-packaged driver, the packages marked witih a single + -sign can be used together to install all CUDA tools sans bundled graphics driver, reducing the chance of conflict with the driver installed earlier.
Table 4. Meta Packages Available for CUDA 9.2

Meta Package	Purpose
++ cuda	Installs all CUDA Toolkit and Driver packages. Handles upgrading to the next version of the cuda package when it's released.
cuda-9-2	Installs all CUDA Toolkit and Driver packages. Remains at version 9.2 until an additional version of CUDA is installed.
+ cuda-toolkit-9-2	Installs all CUDA Toolkit packages required to develop CUDA applications. Does not include the driver.
+ cuda-tools-9-2	Installs all CUDA command line and visual tools.
cuda-runtime-9-2	Installs all CUDA Toolkit packages required to run CUDA applications, as well as the Driver packages.
+ cuda-compiler-9-2	Installs all CUDA compiler packages.
+ cuda-libraries-9-2	Installs all runtime CUDA Library packages.
+ cuda-libraries-dev-9-2	Installs all development CUDA Library packages.
cuda-drivers	Installs all Driver packages. Handles upgrading to the next version of the Driver packages when they're released.

In the table above, the cuda meta-package, decorated with a ++ prefix, is the target we seek to build, and it is comprehensive.
However, if there is unresolvable conflict between the driver in the CUDA library and that of the machine we are attempting to installl on, the five targets together marked with a single + can be installed without attempting to change the current driver. Note that a metapackage is a package that consists of nothing but dependencies, so it is of no use once the build is done.

For the following exercise, make sure your terminal commands are issued from the same directory to which the cuda deb[network] installer file was downloaded in the steps above - usually that's ~/Downloads ...

[ ] THIS DOESN'T COUNT AS ONE OF THE FOUR STEPS - JUST A REMINDER TO cd TO THE DOWNLOAD AREA OF THE DEB FILE:
cd ~/Downloads
[ ] Return your browser to the page from which we downloaded the deb[network] cuda installer, there are instructions consisting of four installation steps.
[ ] Follow the first three (of four):
1. sudo dpkg -i cuda-repo-ubuntu1710_9.2.88-1_amd64.deb
2. sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1710/x86_64/7fa2af80.pub
3. sudo apt-get update
4. DO NOT RUN THE LAST STEP SHOWN IN THE DOC: ~~sudo apt-get install cuda~~ (did you run it anyway?? sorry, format SSD and take it from the top)
  
  Complete the cuda installation using Synaptic
  NOTE: This process is very similar to what we did for installing the 396.26 graphics driver above, except there will be fewer conflicting files, and a couple of (former) "bafflers" not encountered before to which we have already worked out the solution as shown below.
[ ] 1. In synaptic, search for cuda - NOTE: It's helpful at this point to search only [NAME] rather than [DESCRIPTION AND NAME] - if forced to look for dependencies later on it will be helpful to look beyond the name of the module itself.
[ ] 2. The only CUDA (cuda) modules of interest at this point are version 9.2.88 (not concerned with 9.1.xx at this step of the process)
[ ] 3. There is one package called just cuda
1. Select that
2. Mark all of it's dependencies
3. After marking, one package is broken: the cuda package you originally selected
  - [ ] (note that command #4, which you were warned to not run, above, is also a build of nothing but this one target - the warning is because the build will leave the graphical system in an unusable state)
[ ] 4. Unmark it, and preferably remove it (I did not know how at the time)
1. Make sure that all its dependencies that were originally pulled up, remain.
2. NOTE: In these attempts to install CUDA on 18.04, the only ones that succeed are where NVIDIA-supplied CUDA deb packages install on top of ubuntu-ppa-supplied nvidia graphics driver packages, and ppa driver must have version equal to or greater than that of the driver native to the CUDA install, usually equal at 396.26. But the top-level CUDA metapackages are 'strongly disliked' by synaptic, even though everything builds - like this situation. I suspect that's because the CUDA package when faced with "foreign" install of a graphics driver with the same version number, tries to force install and fails, and so reports that not all of its dependencies have been satisfied, even though everything installed meets its dependencies. And in these cases I can always (?) get rid of the metapackage itself, but keep everything else.
[ ] 5. Conflicts remain: Among the dependencies there are two conflicts:
1. nvidia-settings
2. libxnvctrl0
[ ] 6. Select each of nvidia-settings and libxnvctrl0, individually, one after the other, and for each one force the version to 396.26. This is accomplished by selecting the package in the main window, opening the Package menu from the main Menu Bar, and selecting the entry Force Version, which brings up the conflicting choices, and click on the one you wish to favor, in this case 396.26.
[ ] 7. Apply the changes, hit the run (Apply) button
1. IF BARF: if you did not remove the cuda file above (I didn't know how) the upgrade process will stop on launch and complain that the broken package exists, and you only have one button to acknowledge that.
2. But, now hit the run button again and everything works. Synaptic got rid of the corpse of the offending package without undoing its dependencies.

DONE, INSTALLING.

BEFORE YOU BOOT (post installation steps).

Above, I said "the only ones that succeed are where NVIDIA-supplied CUDA deb packages install on top of ubuntu-ppa-supplied nvidia graphics driver packages"
In the following I'm going to refer to these two packages generically as: NVIDIA-CUDA.deb and ubuntu-ppa-nvid-graphics.deb ... respectively, and I'm going to refer to the non-nvidia cuda package by the name: "nvidia"-cuda-toolkit ... where the quotes indicate that it's somewhat of a misnomer.

Looking at what happens after install will help us understand several key facts:

[ ] #1: On Page 26 of the installation manual the NVidia doc is very clear that I need to set my path like so, and why. This step can be automated if I'm willing accept defaults. This is how I and my applications find the cuda toolchain and pals:
[ ] 1. export PATH=/usr/local/cuda-9.2/bin${PATH:+:${PATH}}
[ ] #2 On Page 26 of the installation manual, to the surprise of nobody, the doc is also very clear that I need to set lib path, though there's an alt method and again, could be automated. And again, when my toolchain is looking for stuff to link with, if it's CUDA-related then it's reachable via this path:
[ ] 2. export LD_LIBRARY_PATH=/usr/local/cuda-9.2/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
[ ] #3 On Page 31 of the installation manual, the doc hedges this one, saying maybe I should set it, maybe not, but the installer might do it for me under the right conditions. TIL that statement is actually true.
[ ] 3. sudo apt-get install g++ freeglut3-dev build-essential libx11-dev libxmu-dev libxi-dev libglu1-mesa libglu1-mesa-dev
[ ] Regarding item 3, above:
[ ] What I've observed when installing NVIDIA-CUDA.deb over ubuntu-ppa-nvid-graphics.deb, is that the NVIDIA-CUDA.deb package calls exactly line 3. above in the process of the CUDA install, refreshing freeglut3, mesa, etc, with line 3., which in fact is excerpted (along with lines 1. and 2.) from the Post-Installation Instructions nvidia Installationa and Validatino manual.
[ ] Similarly, on RHEL 7.4 for which I happened to have a pure prop-driver ISO that I burned a couple of months agao, no nouveau, etc, slightly dated nvidia drivers, no cuda. When I run the prop CUDA installer on a freshly scrubbed box from that ISO, the NVIDIA-CUDA.rpm package refreshes the proprietary drivers set by the (proprietary) nvidia graphics install. So the *`NVIDIA-CUDA.`* package appears to follow a when-in-Rome policy.*
[ ] Now, here's a funny thing: Reading the manual, one can infer that for CUDA release X.Y, the existence of /usr/local/cuda-X.Y - specifically /usr/local/cuda-9.2 above - is non-negotiable: the exported paths set in bullets 1. and 2. above are for my benefit as I sit at the console, running programs. How does CUDA know where CUDA is?
CUDA uses only its version - set in its header files - to determine its home in the system.

Where's CUDA?

[ ] By default, NVIDIA-CUDA.deb for version X.Y (e.g. 9.2) does the following:
[ ] NVIDIA-CUDA.deb creates /usr/local/cuda-X.Y, which may be a symlink if the user specified a custom directory, for example /local/UDAC, as the place to home the cuda-X.Y files - but that symlink /usr/local/cuda-X.Y, must still exist with exactly the expected version-based name so that cuda can identify /local/UDAC as its home. This is what CUDA does instead of, for example, depending on a persistent system environment variable.
[ ] NVIDIA-CUDA.deb creates /usr/local/cuda if it does not exist, unless instructed otherwise, but which is a symlink always so that it can be pointed at the default install of cuda. My reading of the current manual says its always pointed at the most recent install. So in our our running example, /usr/local/cuda → /usr/local/cuda-X.Y → /local/UDAC
[ ] NVIDIA-CUDA.deb puts all cuda libs in /usr/local/cuda-X.Y/lib64 (64bit case) In this example means creating a directory in /local/UDAC.
[ ] NVIDIA-CUDA.deb puts all relevant CUDA binaries in /usr/local/cuda-X.Y/bin --> same deal as lib64
[ ] What's strange is that the "nvidia"-cuda-toolkit does not create or maintain a /usr/local/cuda or /usr/local/cuda-X.Y directory, which are required for CUDA and the applications that use it to function. Likewise, those CUDA libraries, binaries, and source files which can be found are not stored in a manner that is consistent with the way that the CUDA system currently operates. Which, in turn, explains a great deal about test results.
[ ] Let's talk about validation: sudo reboot now

WHEN YOU COME BACK FROM REBOOT: VALIDATE THE INSTALL
Benchmark 01: (score from 0 → 3)
[ ] Do the sample files exist anywhere on the system?
[ ] Does the builder script exist?
[ ] Is the script on path?
Benchmark 02: (score from 0 → 10) All-or-nothing:
[ ] Can the sample files be installed? Or are paths and libraries so broken that the installer itself can't successfully run to comlpletion (e.g. can't find libGLU.o)?
1. cuda-install-samples-9.2.sh ~
2. cd NVIDIA_CUDA-9.2_Samples
Benchmark 03: (score from 0 → 50) Will it build?:
[ ] What fraction of the projects will build? Will the makefile run to completion or will it encounter an exception it cannot handle?
1. make
  - Don't worry about warnings, e.g. not having an MPI compiler. Be grateful you do not.
  - There are some crusty examples in here, beta feature warnings, not everything will build and not everything should,though I believe if everything is setup right the makefile will catch the problem and keep going.
  - BENCHMARK FOR PERFORMANCHow does CUDA know where CUDA is? CUDA uses only its version - set in its header files - to determine its home in the system.E:
  - Via;
  - $ find . -type f | sed 's/.*\.//' | sort | uniq -c
  - ... about 800 source files
[ ] BURN ALL THE CARDS:
1. cd 5_Simulations/nbody
2. ./nbody

Results

Using the standard NVIDIA-packaged CUDA 9.2 Release

Using the 396.26 ppa build of nvidia graphics drivers

... Cuda installs, the makefile runs end-to-end. I've done the whole thing on two machines:

Aorus Gaming 7 / i9-7900X / 128GB / 1080Ti-UHD, TitanV-Headless, (Clocks flat), Pop!_OS 18.04.
About 10-12 minutes to make
Acer AN515-15
Time to make unknown, but it ran

In Re `"nvidia"-cuda-toolkit`

I like free and open software, but this story makes me sad.

Some, but not much, of the cuda toolchain and resources are in a folder called /usr/lib/nvidia-cuda-toolkit.
As stated, no symlink /usr/local/cuda-9.1 points to /usr/lib/nvidia-cuda-toolkit, meaning that CUDA has no way of definitively knowing where its home is, so it's not a viable package.
The "nvidia"-cuda-toolkit package does not seem to include the sample code, so I downloaded a deb package containing the samples and an installation script.
The installer package for the sample sources, running in the environment set up by the free and open "nvidia"-cuda-toolkit, issued a variety of warnings and ultimately finished with severe warnings when its dependency checks showed that libGLU.so, libXi.so, and libXmu.so could not be found.
In other words, the free and open "nvidia"-cuda-toolkit.deb package did not install the free and open software packages required to run cuda, nor the non-free alternatives. Key dependencies on free software are not being checked.
That's on top of not creating the home structure required to run. The same free and open packages that are installed by the acutal Nvidia CUDA installation package, as discussed earlier.
My conclusion in trying to get the "nvidia"-cuda-toolkit to install components properly is that it would be easier to do from scratch than repair. Except that there's already a package that comes pretty close to doing a proper install by itself. ---END

BStudent commented 6 years ago

The Case for a New Packaging of CUDA Libs

In trying to make a 9.1 CUDA install, which would play nice with the current driver, I tracked down (most of) the post-install idiosyncrasies in the latest nvidia-cuda-toolkit deb / ppa package - which wraps the CUDA 9.1 build.

The most telling thing (to me) is that I've never heard of this debian packaging of CUDA, not from friends and colleagues - including those who are very free-software-centric, not from message boards, not at AI / Machine Learning conferences and events. There's nobody complaining about it on AskUbuntu.

So, given that strange observation, I go in and do what I or anyone in AI/ML/HPC would do on a new install to check that everything works - i.e. install and run the CUDA samples. And not only does it not make, it starts blowing up on install. And everything that blows up would have blowed up real good at least back to version 6.

Either I'm the first person who's tried making a serious stab at using the debian cuda package, or I'm the first person who's tried it and didn't just immediately move on. Actually I guess that's the same thing.

Point being, it is not what I'd consider a starting point.

For the "nvidia build" of Pop, I think it best to start over with a CUDA installer that is a minimal shell on top of the nvidia CUDA deb installer, modified to do three key things:

Use the pop / deb / ubuntu -packaged version of the graphics drivers - those do get used all the time and there's a lot of dependencies on them - including Pop 76 power - and there's a lot of people with strong opinions about GL. Pop would need to get a little bit ahead of CUDA releases, but I would suspect that's not hard. NVIDIA has beta and preview programs with key customers AFAIK.
Perform all of the pre- and post-install actions ordinarily done by hand when installing nvidia-supplied packages. Because you control the OS, it's easy to automate a bunch of the steps in the CUDA installation manual.
Not only install the CUDA sample files, but leverage them and the fact that they can be built by a single Makefile which serves as a massive regression test. Make it part of the Build / QA suite for releases. That is what CUDA programmers do on a new install. Those sample files are not installed on the nvidia-cuda-toolkit package supplied by debian / ubuntu.

Bonus: Ideally, as time rolls forward, maintain the ability to have side-by-side builds of different CUDA releases, because many frameworks like Tensorflow and Keras want to use a build from one or two releases back, the same way servers have an LTS. The ability to do side-by-side installs is baked in to the CUDA toolchain and library structure, so that update-alternatives can be used to swap between them. That capability is discarded by the deb-ubuntu package: whatever version is being installed gets stuffed into /usr/bin and /usr/lib/x86_64_whatever, flat.
In other words, CUDA is treated like a graphics driver (there can be only one!), when it's really somewhere between a compiler / interpreter / OS that can be running different versions on different GPUs sharing the same bus.

Another reason to roll your own installer for CUDA is that despite the fact that it is far more massive, it has far fewer interdependencies with other libraries than do the graphics drivers - way less headache. Usually the only pain point with CUDA is GL, and in that case the Nvidia installation provide the package list and install command for the open-source GL libs that can be substituted for their own, as well as flags with their installer that tell it explicitly not to mess with the existing GL installation.

Finally, and perhaps most important - it would be the easiest way of doing things correctly. Yuu've already got a build that assumes an NVIdia card, and NVidia's installers and methods are very consistent. I don't see a custom CUDA installer as being more than a thin wrapper that uses the deb-ubu graphics driver, and only needs that to the degree that you've got other components coupled to those drivers.

BStudent commented 6 years ago

Test

compiaffe commented 6 years ago

@BStudent I am just about to install CUDA on my popos 18.04 - Is the howto in here still the way to go?

WatchMkr commented 6 years ago

Pop!_OS now has nvidia-396 drivers. We have not yet packaged CUDA 9.2 however, 9.1 is available in the repository. If you don't need 9.2 follow the instructions here: https://support.system76.com/articles/cuda/

BStudent commented 6 years ago

@compiaffe under any circumstances I would only try this on a clean install: p(brick) is high.
The problem is not so much the CUDA version as the installers. What I am personally doing now is waiting: the fix is for someone to make a deb installer that follows the conventions of the CUDA Linux Installation Guide - like, using a particular scheme for directories, paths, and symlinks.

I don't know whether @WatchMkr is referring to a rebuilt version of the cuda 9.1 deb - but I would not use the one from the ubuntu-related launchpad ppa as of a few weeks ago. If the system76 team has a fixed version of the 9.1 installer (i.e. that follows CUDA Linux Installation Guide) made in the last few weeks, then I'd be confident in it for sure.

@WatchMkr did s76 make new packages for cuda 9.1?? The 9.1 deb I tried three weeks ago (which appeared to be from a non-system76 ppa) was broken but looked like it could be fixed easily by putting components where CUDA expects them to be. I've gotten 9.1 to install and build/run the sample files correctly, but not from the ppa installer. The problem isn't with drivers, it's with directories and paths:

For 9.1 there must be a directory or link named: /usr/local/cuda-9.1 if /usr/local/cuda-9.1 is a directory then the following must exist /usr/local/cuda-9.1/bin /usr/local/cuda-9.1/lib Cuda 9.1 (or any other version) knows that its "namesake" path exists, and knows where all of its components are relative to that, as a result of knowing its compiled-in version number. and if /usr/local/cuda-9.1 is a link, e.g. to user-specified directory: /usr/what/ever
/usr/what/ever/bin /usr/what/ever/lib ... same deal. The user-specified directory has the same structure and a symlink with the expected name points to it. That's what's broken in the ppa cuda installers I've used so far. For Ubuntu, there is a post-installation step performed by the NVidia-supplied installers that loads that correct open-source mesa, LibGLu, etc. Again, in the manual as an optional post-installation step.

There are some other pre- and post-installation steps, but that's the nature of issue; it applies to all CUDA versions I've used on Linux.

compiaffe commented 6 years ago

Is there a workaround or a fix on the horizon?

BStudent commented 6 years ago

@compiaffe Is there a workaround or a fix on the horizon?

I'm not a s76 employee so I cannot speak for them. And it's unlikely they would comment on specifics until / unless they are looking for beta testers or actually released it ...

However @WatchMkr said s76 has a 396 driver in place. That was not the case a few week ago. Given that they have more than enough work to keep them busy, and also given that (I'm fairly certain) CUDA 9.2 is the only thing in Nvidia-land and Ubuntu-land that has hard dependence on 396 drivers - specifically 396.26 - it would make little sense to mess with 396 if there wasn't a CUDA task in the pipeline.

I also assume that some person at s76 has a market research task of scouting the NVidia support boards and various deep learning forums to figure out what the demand is.
@compiaffe , I'm sure you have a sense like I do of what will happen when people start going to those forums and saying "PopOS has CUDA support that just works" ...

BStudent commented 6 years ago

@Rapha Sorry, when I looked at your message before I only saw the part that said "Is there a workaround or a fix on the horizon?"

The behavior you experienced is consistent with the problems I found.

nvidia-docker expects to find /usr/local/cuda-9.1 etc ...

On Mon, Jun 11, 2018 at 5:43 AM, Rapha notifications@github.com wrote:

The system76 deb https://support.system76.com/articles/cuda/ does not seem to install CUDA or at least it is not visible to i.e. nvidia-docker.

@BStudent https://github.com/BStudent I assume that is what you were referring to in your last message? @jackpot51 https://github.com/jackpot51 Is there a workaround or a fix on the horizon?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/system76/docs/issues/84#issuecomment-396230130, or mute the thread https://github.com/notifications/unsubscribe-auth/AS4Ciq3KMlOp5LTlBP7tChxqUHJmrNTUks5t7mXygaJpZM4UCyaQ .

-- B. Student B.Student.CFA@gmail.com

"There's no such thing as spare time, no such thing as free time, no such thing as downtime, all you've got is lifetime. Go.”

Henry Rollins

compiaffe commented 6 years ago

@Bstudent that one solved itself. I'm working with an external graphics card which needed to be plugged in. Once the graphics drivers are loaded nvidia-docker runs smoothly. I also noted that the newest system76-nvidia driver is a version 375 if I remember correctly.

On June 11, 2018 11:17:58 PM UTC, "Bourbakis Student, Ph.D., CFA" notifications@github.com wrote:

@Rapha Sorry, when I looked at your message before I only saw the part that said "Is there a workaround or a fix on the horizon?"

The behavior you experienced is consistent with the problems I found.

nvidia-docker expects to find /usr/local/cuda-9.1 etc ...

On Mon, Jun 11, 2018 at 5:43 AM, Rapha notifications@github.com wrote:

The system76 deb https://support.system76.com/articles/cuda/ does not seem to install CUDA or at least it is not visible to i.e. nvidia-docker.

@BStudent https://github.com/BStudent I assume that is what you were referring to in your last message? @jackpot51 https://github.com/jackpot51 Is there a workaround or a fix on the horizon?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/system76/docs/issues/84#issuecomment-396230130, or mute the thread

https://github.com/notifications/unsubscribe-auth/AS4Ciq3KMlOp5LTlBP7tChxqUHJmrNTUks5t7mXygaJpZM4UCyaQ .

-- B. Student B.Student.CFA@gmail.com

"There's no such thing as spare time, no such thing as free time, no such thing as downtime, all you've got is lifetime. Go.”

Henry Rollins

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/system76/docs/issues/84#issuecomment-396416402

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

BStudent commented 6 years ago

Electricity helps! I haven't played with nvidia-docker much, but that makes sense. Have you run many packages, frameworks, or samples in the container yet?

On Mon, Jun 11, 2018 at 10:08 PM, Rapha notifications@github.com wrote:

@Bstudent that one solved itself. I'm working with an external graphics card which needed to be plugged in. Once the graphics drivers are loaded nvidia-docker runs smoothly. I also noted that the newest system76-nvidia driver is a version 375 if I remember correctly.

On June 11, 2018 11:17:58 PM UTC, "Bourbakis Student, Ph.D., CFA" < notifications@github.com> wrote:

@Rapha Sorry, when I looked at your message before I only saw the part that said "Is there a workaround or a fix on the horizon?"

BStudent commented 6 years ago

Quasi-Solution

Today NVIDIA sent out a general email blast to their gpu- and dev-centric mailing lists inviting evabody on linux* with a GPU to join the NGC and use its container registry.

Anyone can now sign up. Originally, containers were only for use in cloud services, but about 6 mos ago NVidia announced that anyone with a Titan V or Titan Xp could download the containers and use them locally as well. At the time, they did NOT announce the fact that actually anybody with a Pascal or higher architecture card (e.g. 10x0 and 10x0Ti) could also download and use the containers. As @compiaffe discovered, the containers only really need to do an lspci to figure out what GPUs are on your PCIe bus. And they just need the graphics driver to .... drive graphics. All the other carp is taken care of.

This assumes your computer is having a "docker works again" day.

The array of containers is impressive, and exposes full features of the card.

Sign up here: https://ngc.nvidia.com/

B
in this context, linux means prepackaged for Ubuntu and CENTOS 7/RHEL 7, ymmv elsewhere.

BStudent commented 6 years ago

Quasi-Solution V2 (Python Ed)

Note also that Anaconda cloud has a variety of official and unofficial CUDA implementations and related GPU-friendly packages.

mmstick commented 6 years ago

Was able to get CUDA 9.2 installed, compile all the samples, and run them with the following Bash script:

#!/bin/bash
DEB="cuda-repo-ubuntu1710_9.2.88-1_amd64.deb"
KEY="https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1710/x86_64/7fa2af80.pub"
NET_INSTALLER="https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1710/x86_64/$DEB"

set -ex
wget $NET_INSTALLER
sudo dpkg -i $DEB
sudo apt-key adv --fetch-keys $KEY
sudo apt update
sudo apt install -y cuda-{toolkit,tools,compiler,libraries,libraries-dev}-9.2 \
    g++ freeglut3-dev build-essential libx11-dev libxmu-dev \
    libxi-dev libglu1-mesa libglu1-mesa-dev

No reboot required. Had system76-driver-nvidia installed beforehand.

Now just need to look into getting it packaged locally, so as not rely on NVIDIA's debian repo.

BStudent commented 6 years ago

@mmstick that's fantastic.
Packaging it yourself and using system76-driver-nvidia is a great way to leave nothing to chance.
Along that same line, unless there's some practical reason it cannot be done, I recommend setting the key environment variables: export PATH=/usr/local/cuda-9.2/bin${PATH:+:${PATH}} export LD_LIBRARY_PATH=/usr/local/cuda-9.2/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}} ... because user doesn't have a choice to put the bins and libs anyplace else, no point invoking the cost of a support request from someone who didn't read the NVIDIA manual. As a matter of fact, for the NVIDIA version of Pop!OS I say just preinstall CUDA.

I will try to replicate: I've got an Oryx Pro fresh out of the box and also a clean install of the latest Pop on an Acer Nitro AN515-51.
@mmstick are you using the available system76 driver on the ppa?

mmstick commented 6 years ago

@BStudent Yes, I have system76-driver-nvidia installed, which is there by default on our hardware. I am in the process of packaging it from the runfile, and already have a profile script in place to set the environment variables. Just need to deal with some conflicts between path-handling and packaging issues with the installer (the runfile does not expect to be run in an environment that builds packages).

BStudent commented 6 years ago

Yup. Are you also installing the samples and the script that transfers them? The reason I ask is that NV used to make them separately downloadable - and the links on the NV site might lead you to believe they still are. But no, the only way to install them is via the installer. That's a good thing to include but I'm guessing you already thought of it. Anyway: congratulations, @mmstick, you have beat the game.

With that done, It's probably a very short distance to offering a "deep learning stack," which, when announced on hacker news, nvidia cuda dev forums, and deep learning subreddits will cause a traffic jam on the pop_os download servers and hopefully back up the hardware supply chain as well:

1) (the easy part) Add current versions of TensorFlow, CNTK, KERAS, and (Real) Torch (the latter probably bundled with Lua) to the Pop Shop. I don't think there's been an ML framework that just runs on install, other than the $50K+ DGX workstation.
Side note: might want to also toss MS-R-OPEN, Intel Python, and RStudio in to the "Deep Learning Department" of PopShop as well, although those last three items are not CUDA.

2) (this has a potentially tricky part: it might require alternate gcc - not sure): There are (or were) a few popular learning frameworks that run a version or two behind on CUDA. I'm fairly sure that versions of CUDA usually work with later versions of the nvida graphics driver, and cuda is architected for side-by-side installs via the alternatives sytem. So creating backports of the installers 9.1 and 9.0, in their own packages installable side-by side would get maximum user coverage. Of the four most-likely-to-stick-around packages (Tensorflow, CNTK, KERAS, Torch), I don't think the current version of any of them uses a CUDA version lower than 9.0, however 8.0 and 7.5 were popular in their day.

Getting CUDA and a couple of frameworks going out of the box without hassle will be a thing worth shouting from the rooftops.

B

mmstick commented 6 years ago

@BStudent So, while I originally had the samples included as a separate package, they're not going to be included because the samples include artwork assets which only NVIDIA has the rights to distribute.

However, I do see that NVIDIA has a new samples project hosted on GitHub starting with the CUDA 9.2 Toolkit release, which does seem to be redistributable, so I will be including that.

So creating backports of the installers 9.1 and 9.0, in their own packages installable side-by side would get maximum user coverage.

The packaging project that I'm working on does currently use update-alternatives, so it should be possible to get multiple toolkits installed at the same time, and simply alternate between them.

mmstick commented 6 years ago

@BStudent TensorFlow seems to depend on NVIDIA's cuDNN library. The license agreement doesn't allow us to redistribute it. It both forbids and allows it at the same time, and it opens System76 up to being regularly audited by NVIDIA if we distribute it. So, I don't think that will be possible.

TensorFlow should move towards using an open source DNN library, or NVIDIA should open source / relax their restrictions on distributors.

BStudent commented 6 years ago

@BStudent So, while I originally had the samples included as a separate package, they're not going to be included because the samples include artwork assets which only NVIDIA has the rights to distribute.

Yeah I think someone dropped the ball on that one - some of those go way back to the bad old days. Probably a good compromise move would be to include in the README.md instructions for downloading the RUNFILE and executing it with the flag that only inistalls the demo files. To a first approximation the samples don't change much. I can volunteer to write the three lilne blurb myself :-). BTW, if you haven't played with nbody (esp on a multi GPU machine), you should check it out ...

However, I do see that NVIDIA has a new samples project hosted on GitHub starting with the CUDA 9.2 Toolkit release, which does seem to be redistributable, so I will be including that.

Yeah they hit the high points on the github version, but it's got a long way to go to be comprehensive.

Totally agree on DNN, it might even be worth asking NV to open it up. Probably the issues is that they wanted to get to market fast and used code from some third party who wants to get paid.

Meanwhile, people just gotta learn Lua. Also, CNTK is supposed to be very good. I'm sure it integrates well w/vscode.

Actually, a cool project would be to port Torch to Rust. Sadly I am already so overcommitted it ain't funny.

BStudent commented 6 years ago

@mmstick I created a basic document for sample library download / install instructions and pushed you an invite to the project (CUDAPOP_production), so you can fork it and hand it off to whoever is in charge of that. It is probably better than starting from a blank sheet of paper.

mmstick commented 6 years ago

@BStudent Thanks! You may be happy to know that I've been able to get the CUDA 9.0 / 9.1 / 9.2 toolkits installed alongside each other (each fully patched) with the packaging project that I'm working on. Changing between toolkits is as simple as:

sudo update-alternatives --config cuda

We might also end up packaging cuDNN as well. So Tensorflow and other projects might still be a possibility.

Proof:

cuda-toolkit-packaging

BStudent commented 6 years ago

seriously, if you've ever checked out the stress levels on askubuntu and the nvidia dev forums over people trying to get cuda working when a new nvidia driver comes out ... Nv should be very motivated to open and/or let you package cuDNN. Totally different economic pressures than with graphics drivers.

mmstick commented 6 years ago

@BStudent I've developed tooling so that anyone can easily (ish) build, maintain, and host their own CUDA toolkit repo locally, if desired. This is likely useful in environments with many machines, or where there are strict requirements in place on how software can be obtained (locally, not externally).

The process effectively goes along the lines of:

Get a build server setup w/ sbuild, and also ensure that you have a GPG key to sign the repo.
Obtain the debrepbuild utility (name subject to change), which will be responsible for building the repos. It's Rust, so it's an easy cargo install, though it requires nightly at the moment.
Get our CUDA project, which has a repo directory where the TOML config to build the repo is defined.
Use the acquire_assets script from that directory to pull in all the needed assets. It uses checksums to verify that assets are valid, and can be re-run at a later time as this repo config is updated.
Change the origin, label, email, etc. in the sources.toml to the values you want.
Simply execute debrepbuild from the repo directory, and wait for it to finish. It can be resumed if it fails on a certain package, without needing to rebuild packages that were already built. It may take a very long time, due to the size of the toolkit (1.2 GB per deb package). You may comment out the toolkits which you don't wish to build / download.
You'll then have a dists and pool directory generated which can then be publicly hosted on your package server.

The debrepbuild utility is also what we use for generating our Launchpad-less proprietary repo.

BStudent commented 6 years ago

That would address issues I have personally experienced. -- B. Student "There's no such thing as spare time, no such thing as free time, no such thing as downtime, all you've got is lifetime. Go.”

Henry Rollins

mmstick commented 6 years ago

If you do the build, then add the following in your sources list:

deb file:/<PATH_TO_DIR_WITH_DISTS_AND_POOL> bionic main

Then an apt update, you should get the following:

screenshot from 2018-07-03 15-26-01

BStudent commented 6 years ago

yes, I'm definitely going to take it from the top with my oryx pro

mmstick commented 6 years ago

cuDNN is packaging -- working on TensorFlow, at the moment.

BStudent commented 6 years ago

cool - when I pm'ed you earlier I hadn't yet carefully scrubbed the instructions in your past few posts, which was on this afternoon's task list. Now that I have, I see that my questions about custom configs are indeed answered. Although it looks like the first thing I need to do is set up a separate build server.

BStudent commented 6 years ago

OK. It's probably time to change the FAQ.

Prior to making any changes to my oryx pro, I decided to use my old beater laptop (Acer Nitro AN515-15 w/ 1050Ti) as a guinea-pig for test installing the latest publicly-released code. Obviously this might not work for arbitrary non-System76 hardware, but it's a good sign. Steps as follows:

~1. clean-installed the latest nvidia build of Pop from ISO on website~ not sure it was a clean install

applied all updates
installed Synaptic,
within Synaptic did a search on System76, and
selected the System76 driver and system76 nvidia drivers + any dependencies for install
performed installation
found that cuda is now automatically installed, copied /usr/local/cuda-9.2/sampes to a local dir.
Did a make on the samples, it ran full to completion
I tested nbody, it worked

I'm optimistic that this works on a random gaming laptop.

mmstick commented 6 years ago

Once I get the repos up (after getting TensorFlow packaged), it will just be

sudo apt install tensorflow-1.8-cuda-9.2

Which will pull in bazel & system76-cudnn-9.2, which pulls in system76-cuda-9.2, which pulls system76-cuda. My major changes to debrep are finished as of yesterday, so it's mostly a matter of waiting for TensorFlow to compile (100 minutes per run) and seeing if everything's in order to ship.

BStudent commented 6 years ago

OK say the word when it's ready and I'll try that one-liner on my oryxpro2. I've held it back from installing cuda or tf-type frameworks but otherwise subjected it to normal use. Good test case.

mmstick commented 6 years ago

Packaging TensorFlow with Bazel has been a major PITA. It's hard to imagine that these are industry standards with how ill-equipped their compiling process is. So I'm going to push the repo with just CUDA + cuDNN + Bazel for now. I'll need a bit more time figure out how I can get TensorFlow to build within a sbuild chroot, and I can add that to the repo later.

BStudent commented 6 years ago

Certainly everyone I know who works at google is smart, and the ones who are pure devs indicate that there are well-honed procedures for keeping ops tidy, so the lack of discipline in such a popular piece of code is a mystery when we look at it from the technical side. But ...

Bear in mind that Google makes no money off Tensorflow ... EXCEPT ... when you run it on Google Cloud Compute servers. Google sells time at hourly rates. It is against NVidia policy to allow consumer GPU cards to be used in datacenters. Here is Google's lowest advertised rate for their lowest-end (and lowest $/FLOP) GPU:

1 x Tesla K80 Non-Preemptable Execution = $230/mo ~ 7TF for 32bit FMA

(TF = Trillion FLOPs (Floating-Point Ops) per second, which must always be quantified, in this case the FLOP is represented by FMA = Floating-point Multiply-Accumulate, the standard vector operation used for inner products and solving linear systems, compared using 32-bit floats for which no NVIDIA processors have impeded operations). The above price (pulled off Google just now) is probably for a 2GPU K80, which is the standard config (i.e. two chips per card, sublinear speedup for most applications), with 24GB DDR5 vs 11GB on the 1080Ti. I use 32-bit because the consumer GTX 1080Ti is artificially impaired so that its 64bit FMA performance is an order of magnitude slower than its 32bit FMA. If I recall correctly, the K80 is actually also impaired for 64-bit, but not in a comparable way. Regardless, according to a series of molecular dynamics simulations run in 32-bit Float by: http://ambermd.org/gpus/benchmarks.htm#Benchmarks ... we can see that across the board the GTX 1080Ti is about double the performance of a "Half K80," i.e. one chip running 12GB memory. Assuming Google is giving you the Full K80, note in the benchmark that only adds between 10% - 20% speedup.
Splitting the difference, we can assume that nominally the GTX1080Ti is about 75% faster, i.e. 1.75x the speed of a K80 for 32-bit FMA. Best case for efficiency, if tensorflow can get linear speedup from the K80, then the K80 and the GTX1080Ti are about the same speed. So let's give the K80 the benefit of the doubt and make the initial math easier:

Currently, I can purchase a genuine 1080Ti from Nvidia's Website for $699 (Limit 10 per customer, free shipping) plus tax ~- so about $1400 if you live in California -~ Let's just ignore tax. So $699/$230 ~= 3, so you can buy a GTX 1080Ti with three months' cloud time (AWS costs more!).

But that's Calendar time. We really want to know in payback in terms of how much work gets done by the GPU's. I that calculation, what I called the "best case for efficiency" above is the "worst case for payback time" if we consider buiyng a 1080Ti vs "renting" cloud time from GOOG:

Worst Case Scenario: GTX 1080Ti purchase is break-even wrt cloud computing in 3 months.

Nominal Case: GTX 1080Ti break-even takes 3/1.75 = 1.7 months = 7 weeks + 3 days.

Now consider that AWS is Amazon's single biggest source of profit. And Google's annual reports explicitly say that "catching up" with Amazon in terms of earnings from Cloud Computing is one of their primary corporate goals.

So, why is it so hard to buiild Tensorflow? Why don't they put in the time and effort to make it easier for people to deploy on their own computer?

One thing for sure: nobody in the executive suite is pushing hard on the devs to make it a priority ...

:-)

mmstick commented 6 years ago

I wouldn't say that building it is hard. The issue lies more in how it is built, where it builds to, how it installs, and how you configure the build. It's standard practice for all software projects to use either GNU Make, CMake, Meson, etc. Yet they are using quite a different setup, and they seem to have designed their build system with minimal regard to packaging.

Bazel is their build system, and it is not packaged, of course.
Bazel requires that you run bash compile.sh to compile bazel, and it pulls dependencies from the Internet as it does this.
Bazel seems to require that all builds place build artifacts in $HOME/.cache -- this is forbidden on most distro reproducible build systems, so it's of course incompatible with Debian's sbuild. I found that I could get past the initial issues by setting export HOME=build in the build script.
TensorFlow cannot be configured with command line arguments. It is only configurable by running a Python3 script and answering questions one by one through the terminal. Had to abstract this by just including pre-generated configuration files
The bazelrc file has to be updated in the build script to replace a hard-coded directory
There seems to be no way to configure where it spits out files to? They are thrown all over the system in random locations.

Eventually I got them to compile on my development system, but I have yet to figure out why it fails on our Jenkins-managed build server.

mmstick commented 6 years ago

@BStudent Okay, CUDA 9.0, 9.1, and 9.2 are now in our proprietary repo, which is enabled by default on all new installs of Pop. I'll have cudnn-9.1 in there soon to go alongside cudnn-9.0 and cudnn-9.2. Resuming my packaging efforts for Bazel + TensorFlow.

~/S/debrepbuild (master:971ed2|+0 -0) # apt search system76-cud
Sorting... Done
Full Text Search... Done
system76-cuda/bionic,now 0pop2 amd64 [installed,automatic]
  NVIDIA CUDA Compiler / Libraries / Toolkit Metapackage

system76-cuda-9.0/bionic,now 0pop1 amd64 [installed,automatic]
  NVIDIA CUDA 9.0 Compiler / Libraries / Toolkit

system76-cuda-9.0-dbgsym/bionic 0pop1 amd64
  debug symbols for system76-cuda-9.0

system76-cuda-9.1/bionic 0pop1 amd64
  NVIDIA CUDA 9.1 Compiler / Libraries / Toolkit

system76-cuda-9.1-dbgsym/bionic 0pop1 amd64
  debug symbols for system76-cuda-9.1

system76-cuda-9.2/bionic,now 0pop1 amd64 [installed,automatic]
  NVIDIA CUDA 9.2 Compiler / Libraries / Toolkit

system76-cuda-9.2-dbgsym/bionic 0pop1 amd64
  debug symbols for system76-cuda-9.2

system76-cudnn-9.0/bionic,now 7.1.4~0pop1 amd64 [installed]
  NVIDIA CUDA Deep Neural Network library (cuDNN) for CUDA 9.0

system76-cudnn-9.2/bionic,now 7.1.4~0pop1 amd64 [installed]
  NVIDIA CUDA Deep Neural Network library (cuDNN) for CUDA 9.2

~/S/debrepbuild (master:971ed2|+0 -0) # apt policy system76-cuda-9.2
system76-cuda-9.2:
  Installed: 0pop1
  Candidate: 0pop1
  Version table:
 *** 0pop1 500
        500 http://apt.pop-os.org/proprietary bionic/main amd64 Packages
        100 /var/lib/dpkg/status

BStudent commented 6 years ago

I noticed one of my PoP machines just did an update that looks like it included half the internet, plus a new 396.37 driver.

BTW, what you described in your previous message reeeeaaallllyy sounds like confirming evidence. Most big AI / analytics & modeling projects I've worked on that are genuine research are the exact opposite of systems programming goals: a given script or snippet you're trying afresh has about 50% - 75% chance of being thrown out because it just wasn't a good idea, or the right idea, and it's part of a larger project that itself has maybe 30% - 50% chance of NOT being scrapped. Astro at Google X does a great job of explaining why that's actually a smart thing, and there are a couple of good youtube videos where he talks about the need to "fail fast." So people are writing code like it has a 60%-90% chance of being thrown out in the long run, because that's the actual case. Given TFs dataflow structure and heavy presence of Stanford Engineering (a long-time Matlab shop) at google, TF probably started out mostly as individual neural net systems for specific individual projects, implemented as chunks of Matlab code connected by Simulink blocks - everybody doing this realizes it scales well and they need to pool resources and standardize to create the first prototype version of what will eventually be TF.

As the idea pans out, what typically happens is that the research team starts getting a little more careful, but their goal is to get to a feature complete prototype. At that point they write a spec with the algorithms and equations and hand it over to systems / devops teams without handing over the code: this is a sanity check, and a helpful analytic technique. Why? Sometimes AI/ML/analytics programs give weird unexpected results, and while it might be a bug, it can also be that the code is written as intended, but there is an unanticipated way in which the algorithm, i.e. the idea itself, breaks down and must be fixed - and finally, everything might be working perfectly and the weird result is a genuine anomaly. A good first thing to check is whether the same result is happening in both the prototype and the production models, which have no common source code.
So generally, research and dev will maintain two independent code bases, at least for a couple of years. It's a common practice - I've done this working for multiple companies.
So, as you've probably guessed, your description of the way the TF code base is thrown together sounds exactly like what we'd expect to see from the research prototype, rather than what we'd expect to see in systems code written by grownups....

mmstick commented 6 years ago

Still working on TensorFlow. Gradually fixing build error after build error, and I'm just about there to getting it packaged for 9.2. I can certainly see that a lot of people have issues with building TensorFlow, judging by all of the issue reports on the project page, and even the continuing comments on issues that have long been closed.

The cuDNN package for 9.1 is now available
Fixed a postinst issue with the CUDA packages that would make it fail in sudo-less systems

BStudent commented 6 years ago

Things are shaping up here: My Oryx is doing CUDA/TF smoothly, except it's got a quirk that is only will run NVIDIA mode on the external monitor and will only run INTEL mode on the built-in screen. AFAIK this is not a CUDA thing, but rather some combination of System76 driver configs and PEBKAC on my part. It seems to improve with time if I avoid trying to mess with settings and apply updates as they come out.

But I assume you've got a pretty clear picture of the value your work is providing: for someone whose job is to make forecasts, control processes, or do similar adaptive computations, it's a huge hit to spend weeks getting this stuff set up -- and having it blow up whenever a new video driver is released. There's a lot of people - including students - who would get immediate payback from not having to deal with that.

It's probably prudent to also start thinking about freezing on LTS -tuples of TF, CUDA, and CUDNN. Fortunately you've gotten started at the end of an in-between phase TF 1.9 just came out, with a big push to make it more accessible via deeper Keras integration; so far post-beta, TF has only released painful breaking changes on a full-number release, but that's coming up; my prediction for 2.0 would be closer integration between the main python branch of TF and TFJS - the latter being a separate codebase that's been rebranded and given an API makeover. Meanwhile, CUDA has yet to officially support 18.04, and I'm guessing at this point they would save a raft of major changes for 10.0. And that, in turn, is likely to be linked to the release of the 20-series GPUs in a couple of months.

In addition to legit issues that are a PITA for you, I think a lot of what you see on the TensorFlow boards is that for better or worse, users are being forced to learn new things: a lot of the people who are building from source never built a package from source before. And are perhaps new to linux as well. The people who say "SVD runs slower on TF than it does on my CPU" might be good at Linear Algebra, but have never been forced to think of their system as anything but an abstraction - now they're dealing with a piece of metal that wants to be fed slabs of data of a certain shape and size, grouped in a particular way, and placed into special areas of memory with carefully managed timing.

mmstick commented 6 years ago

@BStudent From what I've read on their issue board, I don't think it is just people being forced to learn new things. Especially what I read from this comment, and a few related issues on packaging and compiling.

I've read that there are teams working for three letter agencies that have even been tasked with attempting to package it as well, and TF has formed a dedicated group to bring everyone together towards the goal of making TF packageable. No one has succeeded though, and while the community wants and demands for CMake as the much simpler alternative to Bazel, the TF authors are only interested in using their Bazel build system, and state that anyone using the CMake build system hidden within contrib will have to maintain the CMake file themselves (which seems like a full time job in itself to keep it updated with all the Bazel changes). Doesn't help that the existing CMake is designed specifically for compiling on Windows and running tests.

To sum it all up, TF will continue to use Bazel and be fundamentally incompatible with how we package software on Linux. They recommend that everyone use Bazel instead of CMake as otherwise you will only end up with a subset of TF. They are open to the idea of someone creating a tool that can convert Bazel BUILD configuration files into CMake files, and providing distribution-compatible tarballs with the generated CMake files so that we would not need to have a Java-based Bazel build system. But ultimately, the state of TF packaging is fundamentally flawed.

I do have TF building in a schroot now. The last remaining issue for me is just figuring out what files I need and where to put them, as well as how to handle C / C++ / Java / Go / Python / etc. packaging, and how to bring it altogether in a system-wide setup. A bit difficult when they've configured Bazel to be compiled and used locally, rather than doing the right thing for packaging.

I'm doing the best I can at the moment to figure out how they are distributing TF. There isn't any documentation on how to install it into the system, or which files need to be packaged after building. They seem to be content with requiring that everyone who builds from source must add the generated build directories to their system paths and building TF applications from the TF sources.

As for performance issues, it does pain me to see that they adamantly propose that everyone use their Python API via Pip, and put most of their work into that, and only briefly mention that they have a C API in a different section. Then completely leave out that they do have a C++ API. I would think that Python is unsuitable for the kinds of things that you might use TF for, as is commonly found when scientific Python applications get rewritten in C / C++ / Rust and suddenly become 100-1000x faster due to now being able to feed more data to the CPU & GPU in less time.

BStudent commented 6 years ago

O.M.F.G. To be honest, I am such a major non-fan of Java that I never looked at what a trainwreck Bazel is. But again, a quick check of Wikipedia points out that Bazel is a reduced functionality version of BLAZE, Google's proprietary internal build tool, which in turn mainly exists in order to keep mobile devices in lockstep with their cloud service platforms. I think there's already a consumer revolt of people doing their own builds of TF - certainly RStudio has. I think you're right on all counts, it is record-settingly bad. Although the one guy complaining about the expense of GPUs hasn't been shopping lately. You can get an overclocked EVGA 1080Ti with a waterblock for less the lowest-end I9. And the high-end i9 is 2/3 way to a TITAN-V. The good news is that it's probably not necessary to officially support any languages for TF other than Python and C++. Python is really TF's target language, and the parts of TF that need to go fast the python code is a very thin layer that wraps BLAS and LAPACK libraries (written in FORTRAN) and highly optimized for CPU / GPU). The big slowdown in TF, at least at one time, was that they used a hodgepodge of different math libraries that can't be easily recompiled together - it was decision making right up there with choosing Bazel. But beyond basic Python and C++, there's more bang for the data science buck to be had with Microsoft R/Open (the MKL version), and CNTK. And then, of course, there's this guy https://github.com/termoshtt/accel.

On Wed, Jul 18, 2018 at 10:01 AM Michael Murphy notifications@github.com wrote:

@BStudent https://github.com/BStudent From what I've read on their issue board, I don't think it is just people being forced to learn new things. Especially what I read from this comment https://github.com/tensorflow/tensorflow/issues/13061#issuecomment-364079096, and a few related issues on packaging and compiling.

I've read that there are teams working for three letter agencies (ie: DOE) that have been tasked with attempting to package it as well, and TF has formed a dedicated group to bring everyone together towards the goal of making TF packageable. No one has succeeded though, and while the community wants and demands for CMake as the much simpler alternative to Bazel, the TF authors are only interested in using their Bazel build system, and state that anyone using the CMake build system hidden within contrib will have to maintain the CMake file themselves. Doesn't help that the existing CMake is designed specifically for compiling on Windows and running tests.

To sum it all up, TF will continue to use Bazel and be fundamentally incompatible with how we package software on Linux. They recommend that everyone use Bazel instead of CMake as otherwise you will only end up with a subset of TF. They are open to the idea of someone creating a tool that can convert Bazel BUILD configuration files into CMake files, and providing distribution-compatible tarballs with the generated CMake files so that we would not need to have a Java-based Bazel build system. But ultimately, the state of TF packaging is fundamentally flawed.

I do have TF building in a schroot now. The last remaining issue for me is just figuring out what files I need and where to put them, as well as how to handle C / C++ / Java / Go / Python / etc. packaging, and how to bring it altogether in a system-wide setup. A bit difficult when they've configured Bazel to be compiled and used locally, rather than doing the right thing for packaging.

I'm doing the best I can at the moment to figure out how they are distributing TF. There isn't any documentation on how to install it into the system, or which files need to be packaged after building. They seem to be content with requiring that everyone who builds from source must add the generated build directories to their system paths and building TF applications from the TF sources.

As for performance issues, it does pain me to see that they adamantly propose that everyone use their Python API via Pip, and put most of their work into that, and only briefly mention that they have a C API in a different section. Then completely leave out that they do have a C++ API. I would think that Python is unsuitable for the kinds of things that you might use TF for, as is commonly found when scientific Python applications get rewritten in C / C++ / Rust and suddenly become 100-1000x faster due to now being able to feed more data to the CPU & GPU in less time.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/system76/docs/issues/84#issuecomment-406003536, or mute the thread https://github.com/notifications/unsubscribe-auth/AS4CioPpYpuRQrYW7GKhV4crfEF-XQ1qks5uH2nrgaJpZM4UCyaQ .

-- B. Student B.Student.CFA@gmail.com

"There's no such thing as spare time, no such thing as free time, no such thing as downtime, all you've got is lifetime. Go.”

Henry Rollins

mmstick commented 6 years ago

Currently working on the C++ portion. C was easy. They just have a lot of C / C++ headers scattered throughout different areas, and expect them to be in certain locations. The script that generates the Python wheel seems to have most of the things in the right places, but it lacks quite a few headers. I hope to have some working examples compiling today.

mmstick commented 6 years ago

It seems that packaging C + Python + C++ all in one go is effectively impossible. The way they have C++ set up requires that you effectively copy your project into the Tensorflow source code and use Bazel to build it. Figuring out how to decouple it would take a long time. So I will probably make C++ a separate package that builds from FloopCZ's CMake files, which have already done the work.

BStudent commented 6 years ago

I don't see a lot of mixed-mode programming on TF (maybe it exists) - I think cloning or forking something like FloopCZ is way more sensible. And despite all the ink in the TF docs about pip and virtualenv being "preferred" (and prominently featuring instructions for Python 2.7), I think most people - including me - use anaconda3 / conda, which is also pushed by most of the MOOCs these days. A most unfortunate name for RHEL users. Using anaconda3 consists of signing up for a free anaconda.org account, which then lets you search for packages on different "channels" - the main ones being anaconda and conda-forge - and then using conda at the command line to create / install / update envs with the packages and versions you want: conda solves the dependency graph to ensure consistency. The packages and other environment parameters are stored as YAML. For $5/mo you can use the anaconda cloud service to host your own builds and environment configs. The anaconda channel already has builds of TF (and everything else) that target CUDA, MKL, and system-native libraries that pretty much work the same across win / mac / linux and are kept up to date - the packages are complete, so, for example, tensorflow-gpu includes not only tf 1.8, but also cudnn and other supporting packages. And pip can run within a conda environment if one needs to put less-mainstream layers on that. I think the easiest and most supportable thing to do for python is to just install anaconda3 / conda together with some pre-built environments - e.g. dependent on whether your laptop currently has the gpu turned on. Either just use anaconda binaries or build your own identical binaries and use anaconda as a benchmark.

mmstick commented 6 years ago

@BStudent I'm pretty close to getting this done at the moment. FloopCZ's CMakefiles have been very useful. As it turns out, I can use it to build the C++ libraries & headers, and then build the C & Python API off the copy of TF that it downloaded and configured. So it can be a part of the same package.

BStudent commented 6 years ago

Nice.
If you feel obligated to add the python package to a virtual environment manager, I think the first choice would be anaconda - but it's also good to draw the line somewhere and work on other things while waiting for customer feedback.
So far as additional GPU-related stuff, it seems like you've taken down the two major barriers:

Getting CUDA installed on Ubuntu - which allows people with intermediate-to-advanced skills or experience to install and build other tools, and
Getting Tensorflow with Python support installed and running on the GPU, which covers the majority of GPU newcomers on Linux

There's definitely high interest in Rust on the part of robotics people, which includes algo - high-frequency traders as well as researchers working with agent-based / population-based massive-concurrency models. Given s76' commitment to Rust, making a thin Rust wrapper for CUDA - or some portions of CUDA - might be a good way to go. I'm far from an expert, but it looks like that was tried with LEAF and it collapsed under the weight of ambition, twice. Big roadmap, extra work due to being early in the life cycle of Rust, not a lot of people on the project. Accel seems to be plugging along, partly by proceeding in manageable increments. Working on something like that is a sort of bet on how popular Rust will become, and how quickly. But it's also a bet that can affect the adoption rate of Rust...

mmstick commented 6 years ago

@BStudent I remember Leaf. The progress made cannot be undone, so if anyone truly wants to, they could volunteer to maintain & continue it. As Rust becomes more commonplace, I'm sure the demand for frameworks like Leaf & Accel will increase.

So, I finally have TensorFlow 1.9 building within sbuild with a static shared library of the C++ API, as well as providing the C & Python libs. Seems to be working well. I'm going to do another build or two as I shuffle some files around and get everything ready to release into the wild. It will use the alternatives feature that CUDA is also using.

The C++ lib is rather huge, though. It's 514 MiB. Not much can be done about that, though, as otherwise you'd need to build C++ projects from within the tensorflow source code with Bazel.

BStudent commented 6 years ago

@mmstick I was just looking at Leaf again ... i thought it was just a wrapper for CUDA etc with some linear algebra functions: but they were basically trying to recreate TF/CNTK -type full framework with compute graph, multiple solvers (optimizers) etc. And do it in a more elegant better-organized way. Maybe Mozilla could do something on that scale, but that's about the smallest org that can sustain a monster project like that. Most of the original frameworks of that crop are slowly dying off - Theano most recently. The 514 MiB doesn't sound that large given the size of TF, actually. I don't know if you've ever spent time taking a class or going through some tutorials in TF, the more I think about it the more impressed I am that it runs at all. I think these frameworks basically offer an almost-infinite number of opportunities for "justifiable" scope creep - moreso than an OS, for example, because at some point the OS is done enough that applications collectively take up way more programmer hours than the kernel. I think there's some principle of design or sw eng that needs to be applied (perhaps invented) to manage learning frameworks. For now, there's probably only a few where API users collectively put in more time than the dev team.

system76 / docs

PopOS 18.04: CUDA Toolkit 9.2 needs packaging; Should use "CUDA samples" Installation and Make as regression tests; Install Instructions for 9.1 and 9.2 on 18.04 are included; Evidence shows alternate CUDA packaging would be easier for all. #84

Overview: The Current State

Build and Run CUDA 9.2 on Pop!OS using pre-release ubuntu ppa nvidia-936~~936.26 and NVidia CUDA 9.2.88 deb packages

Introduction

Part 1: Update ubuntu ppa of nvidia proprietary driver to version 396.26

Where to get the 396.26 driver deb package

Add the ppa deb package 396.26 driver binaries to local repo

Manually add new, remove old versions, and avoid unfortunate side effects

Part 2: Installing the CuDA components

... including differently-packaged drivers with the same version as we just installed

Part I: Obtaining the source materials

BEFORE YOU BOOT (post installation steps).

Looking at what happens after install will help us understand several key facts:

WHEN YOU COME BACK FROM REBOOT: VALIDATE THE INSTALL

Results

Using the standard NVIDIA-packaged CUDA 9.2 Release

Using the 396.26 ppa build of nvidia graphics drivers

In Re `"nvidia"-cuda-toolkit`

The Case for a New Packaging of CUDA Libs

@compiaffe Is there a workaround or a fix on the horizon?

Quasi-Solution

Quasi-Solution V2 (Python Ed)

OK. It's probably time to change the FAQ.

system76 / docs

PopOS 18.04: CUDA Toolkit 9.2 needs packaging; Should use "CUDA samples" Installation and Make as regression tests; Install Instructions for 9.1 and 9.2 on 18.04 are included; Evidence shows alternate CUDA packaging would be easier for all. #84

Overview: The Current State

Build and Run CUDA 9.2 on Pop!OS using pre-release ubuntu ppa nvidia-936~~936.26 and NVidia CUDA 9.2.88 deb packages

Introduction

Part 1: Update ubuntu ppa of nvidia proprietary driver to version 396.26

Where to get the 396.26 driver deb package

Add the ppa deb package 396.26 driver binaries to local repo

Manually add new, remove old versions, and avoid unfortunate side effects

Part 2: Installing the CuDA components

... including differently-packaged drivers with the same version as we just installed

Part I: Obtaining the source materials

BEFORE YOU BOOT (post installation steps).

Looking at what happens after install will help us understand several key facts:

WHEN YOU COME BACK FROM REBOOT: VALIDATE THE INSTALL

Results

Using the standard NVIDIA-packaged CUDA 9.2 Release

Using the 396.26 ppa build of nvidia graphics drivers

In Re "nvidia"-cuda-toolkit

The Case for a New Packaging of CUDA Libs

@compiaffe Is there a workaround or a fix on the horizon?

Quasi-Solution

Quasi-Solution V2 (Python Ed)

OK. It's probably time to change the FAQ.

In Re `"nvidia"-cuda-toolkit`