Closed BStudent closed 6 years ago
If we do get CUDA 9.2 into Pop, we will need to package it on our end.
We'd like to keep this open to track CUDA and nvidia 396. nvidia 396 is in validation and if testing passes, it will be released next week. Once that's available, we can look at packaging CUDA 9.2.
(updated 20180528 - FINAL)
NVidia's CUDA 9.2 offers many features that enhance compatibility with Ubuntu 18.04 and, although 18.04 is not yet officially supported, the are many reasons CUDA programmers and related devoperators may want to start testing them. Due to a late-breaking security patch, the 9.2.88 CUDA release now depends on nvidia's proprietary driver version 396.26. This unplanned change made 9.2.88 incompatible with the 396.24 drivers that the ubuntu Proprietary GPU drivers ppa had prepared. Since that time, pre-release 396.26 ppa drivers versions have been in testing, and shown to work with Pop!OS 18.04.
Starting point: Clean install of PopOS 18.04 (the following used build 23)
These instructions use a pre-release copy stored on an unofficial / unsafe ppa. Proceed at your own risk.
There is an unofficial / untrusted ppa hosting a binary deb installer from a recent staging build created in the official ppa as part of pre-release dev. It is available here: https://launchpad.net/~bstudent/+archive/ubuntu/nvidia-graphics-drivers-396.26-copy-of-staging-ppa-20180522 Before using it you should check the maidriversn ppa repos for current versions.
Even though this should be a smooth upgrade (see note), I'm going to add/update the repo at the CLI but then tweak the dependencies in synaptic:
sudo add-apt-repository\
ppa:bstudent/nvidia-graphics-drivers-396.26-copy-of-staging-ppa-20180522
sudo apt-get update
Normally one would do sudo apt get install whatever-package
as the next step, but that was not working for me on early tries for some reason (PEBKAC?). Odd, because I did not have such problems going from 390.48 to 396.24. As a result, I do this part in synaptic:
Open synaptic
On the category selector bar at left, click the bottom entry Architecture, and at the top of the category selector bar, click All (no filtering, show all packages)
____ Mark the group of packages to install ___
apt
have a Latest Version field that begins with 396.26, hence grouping the items we'll add as well as those to drop. The text below assumes you are replacing drivers version 390.48, but the procedure is the same for 396.24.____ Propagate to dependent or conflicting packages ____
____ Gather and review all packages marked ____
[ ] 4. On the bottom-half of the Category Selector, click Custom Filters
[ ] 5. On the top-half of the Category Selector, click Marked Changes, showing packages we marked or which were marked because they are dependent on marked packages.
[ ] 6. Items to be installed are marked green, ideally including all version 396.26 packages
[ ] 7. To-be-removed items are orange / red; Note that some appear to be things we want to keep. We'll come back to them.
[ ] 8. Clicking Broken in the upper Category Selector shows conflicted packages: some 396.26 packages may appear here pending their dependencies' resolution _____ Regroup, make sure all to-be-removed packages are marked __
[ ] 9. Next, again sort All packages by Latest Version, to group and check that all 396.26 and 390.48 are marked, even if incorrectly.
[ ] 10. Mark 390.48 components for removal (NOT complete removal), and 396.26 for installation, and return to the Marked Changes set.
_ Undo "collateral damage" to Pop!OS and System76 components ___
[ ] 11. Critical Exceptions: A few pop-os / system76 components are marked for removal - right mouse and select Mark For Reinstallation
[ ] 12. The system76-power package may be broken or resistant to marking for reinstall
[ ] 13. Mark the (new) component Nvidia-Prime for removal - it is replaced by system76 power.
[ ] 14. Finally, there may be some unmarked or unremoved "stragglers" among the 390.48 components, which is acceptable if they are not blocking installation of a corresponding 396.26 component.
__ Cleanup and Convergence ___
[ ] 15. Note that stragglers are particularly likely among 32-bit components, which are no longer supported.
[ ] 16. As flags are changed and dependencies align, try Mark All Upgrades again to verify that the new plan is stable and not creating any more new dependencies.
There should be no circular dependencies or deadlocks preventing quick convergence of packages to an acceptable plan. At that point, hit the Apply button and then reboot.
Note that this build is not using the same CUDA packaging previously adopted, first because a CUDA 9.2 version doesn't exist, and second, for other reasons. We'll do two things differently:
First, we're going to base the installation of CUDA directly on the prprietary CUDA Toolkit deb package built for the Ubuntu platform by Nvidia:
[Download]
button for:Linux
→ x86_64
→ Ubuntu
→ 17.10
→ deb[network]
deb[network]
installer above is selected; to be doubly sure, semply click the following link, and click the [Download]
button when presented:Second, by choosing to work with the proprietary installer we have stepped into the land of the non-free: as is customary in this place, we must seek compensation. It comes in the form of extensive docs.
[Documentation > ]
button, leftmost in a row of four large buttons located well beneath the [Download]
button clicked previously (if on a small screen you may need to scroll down). For those who enjoy video-learning, below the [Documentation > ]
button you will see the words "Get Started" writ large above an array of colorful video tutorials that offer to show how to install CUDA, how to learn a bit of CUDA Programming (some are quite good), and how to desire the features of hardware you do not yet own. NOW click the [Documentation > ]
button.The landing page is an index to about fifty volumes of professionally-written software documentation comprising on the order of 10,000 total pages (most available as pdf). To offer some perspective, this collection contains only documentation immediately pertinent to the technical details of CUDA, and does not include material specific to other business lines such as Deep Learning, Computer Vision, nor Nvidia's principal business of computer graphics.
Fortunately, the only thing we need be concerned with now is the link below to the pdf version of the NVida CUDA Installation Guide for Linux: Installation and Verification on Linux Systems.
This is the primary source for CUDA installation and it may prove useful to highlight and annotate it in one's pdf viewer of choice - it runs forty pages and covers critical topics including:
Naming and homing conventions for lib, bin, hdr, src asset, and toolchain elements, so that their status and location are available to authorized people and software.
How installer parameters can be used to alter resource homes globally, consistently, and safely.
Side-by-side homing of multiple CUDA releases under Version-bound and-or Debug-Release -based location names for users or applications (as with many learning frameworks) that are anchored to a specific CUDA release.
Recommended and best-practices for working with open-source libraries for X, GL, or both.
[ ] Also keep a pdf copy of the CUDA Quick Start Guide handy - note this document covers all platforms, and has two separate Linux sections (a separate one for Power8/Power9): https://docs.nvidia.com/cuda/pdf/CUDA_Quick_Start_Guide.pdf
The table below, excerpted from NVida CUDA Installation Guide for Linux, partially documents the deb packages available via the repo we connected to above (including the cuda package we'll try to install)
Also note that if we give up on trying to install cuda bundled with its differently-packaged driver, the packages marked witih a single + -sign can be used together to install all CUDA tools sans bundled graphics driver, reducing the chance of conflict with the driver installed earlier.
Table 4. Meta Packages Available for CUDA 9.2
Meta Package | Purpose |
---|---|
++ cuda | Installs all CUDA Toolkit and Driver packages. Handles upgrading to the next version of the cuda package when it's released. |
cuda-9-2 | Installs all CUDA Toolkit and Driver packages. Remains at version 9.2 until an additional version of CUDA is installed. |
+ cuda-toolkit-9-2 | Installs all CUDA Toolkit packages required to develop CUDA applications. Does not include the driver. |
+ cuda-tools-9-2 | Installs all CUDA command line and visual tools. |
cuda-runtime-9-2 | Installs all CUDA Toolkit packages required to run CUDA applications, as well as the Driver packages. |
+ cuda-compiler-9-2 | Installs all CUDA compiler packages. |
+ cuda-libraries-9-2 | Installs all runtime CUDA Library packages. |
+ cuda-libraries-dev-9-2 | Installs all development CUDA Library packages. |
cuda-drivers | Installs all Driver packages. Handles upgrading to the next version of the Driver packages when they're released. |
In the table above, the cuda meta-package, decorated with a ++ prefix, is the target we seek to build, and it is comprehensive.
However, if there is unresolvable conflict between the driver in the CUDA library and that of the machine we are attempting to installl on, the five targets
together marked with a single + can be installed without attempting to change the current driver.
Note that a metapackage is a package that consists of nothing but dependencies, so it is of no use once the build is done.
For the following exercise, make sure your terminal commands are issued from the same directory to which the cuda deb[network] installer file was downloaded in the steps above - usually that's ~/Downloads
...
[ ] THIS DOESN'T COUNT AS ONE OF THE FOUR STEPS - JUST A REMINDER TO cd TO THE DOWNLOAD AREA OF THE DEB FILE:
cd ~/Downloads
[ ] Return your browser to the page from which we downloaded the deb[network] cuda installer, there are instructions consisting of four installation steps.
[ ] Follow the first three (of four):
sudo dpkg -i cuda-repo-ubuntu1710_9.2.88-1_amd64.deb
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1710/x86_64/7fa2af80.pub
sudo apt-get update
sudo apt-get install cuda
Complete the cuda installation using Synaptic
NOTE: This process is very similar to what we did for installing the 396.26 graphics driver above, except there will be fewer conflicting files, and a couple of (former) "bafflers" not encountered before to which we have already worked out the solution as shown below.
[ ] 1. In synaptic, search for cuda - NOTE: It's helpful at this point to search only [NAME] rather than [DESCRIPTION AND NAME] - if forced to look for dependencies later on it will be helpful to look beyond the name of the module itself.
[ ] 2. The only CUDA (cuda) modules of interest at this point are version 9.2.88 (not concerned with 9.1.xx at this step of the process)
[ ] 3. There is one package called just cuda
[ ] 4. Unmark it, and preferably remove it (I did not know how at the time)
[ ] 5. Conflicts remain: Among the dependencies there are two conflicts:
nvidia-settings
libxnvctrl0
[ ] 6. Select each of nvidia-settings
and libxnvctrl0
, individually, one after the other, and for each one force the version to 396.26. This is accomplished by selecting the package in the main window, opening the Package menu from the main Menu Bar, and selecting the entry Force Version, which brings up the conflicting choices, and click on the one you wish to favor, in this case 396.26.
[ ] 7. Apply the changes, hit the run (Apply) button
DONE, INSTALLING.
Above, I said "the only ones that succeed are where NVIDIA-supplied CUDA deb packages install on top of ubuntu-ppa-supplied nvidia graphics driver packages"
In the following I'm going to refer to these two packages generically as:
NVIDIA-CUDA.deb
and
ubuntu-ppa-nvid-graphics.deb
... respectively, and I'm going to refer to the non-nvidia cuda package by the name:
"nvidia"-cuda-toolkit
... where the quotes indicate that it's somewhat of a misnomer.
[ ] #1: On Page 26 of the installation manual the NVidia doc is very clear that I need to set my path like so, and why. This step can be automated if I'm willing accept defaults. This is how I and my applications find the cuda toolchain and pals:
[ ] 1. export PATH=/usr/local/cuda-9.2/bin${PATH:+:${PATH}}
[ ] #2 On Page 26 of the installation manual, to the surprise of nobody, the doc is also very clear that I need to set lib path, though there's an alt method and again, could be automated. And again, when my toolchain is looking for stuff to link with, if it's CUDA-related then it's reachable via this path:
[ ] 2. export LD_LIBRARY_PATH=/usr/local/cuda-9.2/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
[ ] #3 On Page 31 of the installation manual, the doc hedges this one, saying maybe I should set it, maybe not, but the installer might do it for me under the right conditions. TIL that statement is actually true.
[ ] 3. sudo apt-get install g++ freeglut3-dev build-essential libx11-dev libxmu-dev libxi-dev libglu1-mesa libglu1-mesa-dev
[ ] Regarding item 3, above:
[ ] What I've observed when installing NVIDIA-CUDA.deb
over ubuntu-ppa-nvid-graphics.deb
, is that the NVIDIA-CUDA.deb
package calls exactly line 3. above in the process of the CUDA install, refreshing freeglut3, mesa, etc, with line 3., which in fact is excerpted (along with lines 1. and 2.) from the Post-Installation Instructions nvidia Installationa and Validatino manual.
[ ] Similarly, on RHEL 7.4 for which I happened to have a pure prop-driver ISO that I burned a couple of months agao, no nouveau, etc, slightly dated nvidia drivers, no cuda. When I run the prop CUDA installer on a freshly scrubbed box from that ISO, the NVIDIA-CUDA.rpm
package refreshes the proprietary drivers set by the (proprietary) nvidia graphics install. So the *`NVIDIA-CUDA.`* package appears to follow a when-in-Rome policy.*
[ ] Now, here's a funny thing:
Reading the manual, one can infer that for CUDA release X.Y, the existence of /usr/local/cuda-X.Y
- specifically /usr/local/cuda-9.2
above - is non-negotiable: the exported paths set in bullets 1. and 2. above are for my benefit as I sit at the console, running programs. How does CUDA know where CUDA is?
CUDA uses only its version - set in its header files - to determine its home in the system.
Where's CUDA?
[ ] By default, NVIDIA-CUDA.deb
for version X.Y (e.g. 9.2) does the following:
[ ] NVIDIA-CUDA.deb
creates /usr/local/cuda-X.Y
, which may be a symlink if the user specified a custom directory, for example /local/UDAC
, as the place to home the cuda-X.Y files - but that symlink /usr/local/cuda-X.Y
, must still exist with exactly the expected version-based name so that cuda can identify /local/UDAC
as its home. This is what CUDA does instead of, for example, depending on a persistent system environment variable.
[ ] NVIDIA-CUDA.deb
creates /usr/local/cuda
if it does not exist, unless instructed otherwise, but which is a symlink always so that it can be pointed at the default install of cuda. My reading of the current manual says its always pointed at the most recent install. So in our our running example, /usr/local/cuda
→ /usr/local/cuda-X.Y
→ /local/UDAC
[ ] NVIDIA-CUDA.deb
puts all cuda libs in /usr/local/cuda-X.Y/lib64 (64bit case) In this example means creating a directory in /local/UDAC.
[ ] NVIDIA-CUDA.deb
puts all relevant CUDA binaries in /usr/local/cuda-X.Y/bin --> same deal as lib64
[ ] What's strange is that the "nvidia"-cuda-toolkit
does not create or maintain a /usr/local/cuda
or /usr/local/cuda-X.Y
directory, which are required for CUDA and the applications that use it to function. Likewise, those CUDA libraries, binaries, and source files which can be found are not stored in a manner that is consistent with the way that the CUDA system currently operates. Which, in turn, explains a great deal about test results.
[ ] Let's talk about validation:
sudo reboot now
Benchmark 01: (score from 0 → 3)
[ ] Do the sample files exist anywhere on the system?
[ ] Does the builder script exist?
[ ] Is the script on path?
Benchmark 02: (score from 0 → 10) All-or-nothing:
[ ] Can the sample files be installed? Or are paths and libraries so broken that the installer itself can't successfully run to comlpletion (e.g. can't find libGLU.o)?
cuda-install-samples-9.2.sh ~
cd NVIDIA_CUDA-9.2_Samples
Benchmark 03: (score from 0 → 50) Will it build?:
[ ] What fraction of the projects will build? Will the makefile run to completion or will it encounter an exception it cannot handle?
make
find . -type f | sed 's/.*\.//' | sort | uniq -c
[ ] BURN ALL THE CARDS:
cd 5_Simulations/nbody
./nbody
... Cuda installs, the makefile runs end-to-end. I've done the whole thing on two machines:
"nvidia"-cuda-toolkit
I like free and open software, but this story makes me sad.
/usr/lib/nvidia-cuda-toolkit
./usr/local/cuda-9.1
points to /usr/lib/nvidia-cuda-toolkit
, meaning that CUDA has no way of definitively knowing where its home is, so it's not a viable package."nvidia"-cuda-toolkit
package does not seem to include the sample code, so I downloaded a deb package containing the samples and an installation script. "nvidia"-cuda-toolkit
, issued a variety of warnings and ultimately finished with severe warnings when its dependency checks showed that libGLU.so
, libXi.so
, and libXmu.so
could not be found. "nvidia"-cuda-toolkit.deb
package did not install the free and open software packages required to run cuda, nor the non-free alternatives. Key dependencies on free software are not being checked. "nvidia"-cuda-toolkit
to install components properly is that it would be easier to do from scratch than repair. Except that there's already a package that comes pretty close to doing a proper install by itself.
---ENDIn trying to make a 9.1 CUDA install, which would play nice with the current driver, I tracked down (most of) the post-install idiosyncrasies in the latest nvidia-cuda-toolkit deb / ppa package - which wraps the CUDA 9.1 build.
The most telling thing (to me) is that I've never heard of this debian packaging of CUDA, not from friends and colleagues - including those who are very free-software-centric, not from message boards, not at AI / Machine Learning conferences and events. There's nobody complaining about it on AskUbuntu.
So, given that strange observation, I go in and do what I or anyone in AI/ML/HPC would do on a new install to check that everything works - i.e. install and run the CUDA samples. And not only does it not make, it starts blowing up on install. And everything that blows up would have blowed up real good at least back to version 6.
Either I'm the first person who's tried making a serious stab at using the debian cuda package, or I'm the first person who's tried it and didn't just immediately move on. Actually I guess that's the same thing.
Point being, it is not what I'd consider a starting point.
For the "nvidia build" of Pop, I think it best to start over with a CUDA installer that is a minimal shell on top of the nvidia CUDA deb installer, modified to do three key things:
Bonus: Ideally, as time rolls forward, maintain the ability to have side-by-side builds of different CUDA releases, because many frameworks like Tensorflow and Keras want to use a build from one or two releases back, the same way servers have an LTS.
The ability to do side-by-side installs is baked in to the CUDA toolchain and library structure, so that update-alternatives
can be used to swap between them. That capability is discarded by the deb-ubuntu package: whatever version is being installed gets stuffed into /usr/bin
and /usr/lib/x86_64_whatever
, flat.
In other words, CUDA is treated like a graphics driver (there can be only one!), when it's really somewhere between a compiler / interpreter / OS that can be running different versions on different GPUs sharing the same bus.
Another reason to roll your own installer for CUDA is that despite the fact that it is far more massive, it has far fewer interdependencies with other libraries than do the graphics drivers - way less headache. Usually the only pain point with CUDA is GL, and in that case the Nvidia installation provide the package list and install command for the open-source GL libs that can be substituted for their own, as well as flags with their installer that tell it explicitly not to mess with the existing GL installation.
Finally, and perhaps most important - it would be the easiest way of doing things correctly. Yuu've already got a build that assumes an NVIdia card, and NVidia's installers and methods are very consistent. I don't see a custom CUDA installer as being more than a thin wrapper that uses the deb-ubu graphics driver, and only needs that to the degree that you've got other components coupled to those drivers.
Test
@BStudent I am just about to install CUDA on my popos 18.04 - Is the howto in here still the way to go?
Pop!_OS now has nvidia-396 drivers. We have not yet packaged CUDA 9.2 however, 9.1 is available in the repository. If you don't need 9.2 follow the instructions here: https://support.system76.com/articles/cuda/
@compiaffe under any circumstances I would only try this on a clean install: p(brick) is high.
The problem is not so much the CUDA version as the installers. What I am personally doing now is waiting: the fix is for someone to make a deb installer that follows the conventions of the CUDA Linux Installation Guide - like, using a particular scheme for directories, paths, and symlinks.
I don't know whether @WatchMkr is referring to a rebuilt version of the cuda 9.1 deb - but I would not use the one from the ubuntu-related launchpad ppa as of a few weeks ago. If the system76 team has a fixed version of the 9.1 installer (i.e. that follows CUDA Linux Installation Guide) made in the last few weeks, then I'd be confident in it for sure.
@WatchMkr did s76 make new packages for cuda 9.1?? The 9.1 deb I tried three weeks ago (which appeared to be from a non-system76 ppa) was broken but looked like it could be fixed easily by putting components where CUDA expects them to be. I've gotten 9.1 to install and build/run the sample files correctly, but not from the ppa installer. The problem isn't with drivers, it's with directories and paths:
For 9.1 there must be a directory or link named:
/usr/local/cuda-9.1
if /usr/local/cuda-9.1 is a directory then the following must exist
/usr/local/cuda-9.1/bin
/usr/local/cuda-9.1/lib
Cuda 9.1 (or any other version) knows that its "namesake" path exists, and knows where all of its components are relative to that, as a result of knowing its compiled-in version number.
and if /usr/local/cuda-9.1 is a link, e.g. to user-specified directory:
/usr/what/ever
/usr/what/ever/bin
/usr/what/ever/lib
... same deal. The user-specified directory has the same structure and a symlink with the expected name points to it. That's what's broken in the ppa cuda installers I've used so far.
For Ubuntu, there is a post-installation step performed by the NVidia-supplied installers that loads that correct open-source mesa, LibGLu, etc. Again, in the manual as an optional post-installation step.
There are some other pre- and post-installation steps, but that's the nature of issue; it applies to all CUDA versions I've used on Linux.
Is there a workaround or a fix on the horizon?
@compiaffe Is there a workaround or a fix on the horizon?
I'm not a s76 employee so I cannot speak for them. And it's unlikely they would comment on specifics until / unless they are looking for beta testers or actually released it ...
However @WatchMkr said s76 has a 396 driver in place. That was not the case a few week ago. Given that they have more than enough work to keep them busy, and also given that (I'm fairly certain) CUDA 9.2 is the only thing in Nvidia-land and Ubuntu-land that has hard dependence on 396 drivers - specifically 396.26 - it would make little sense to mess with 396 if there wasn't a CUDA task in the pipeline.
I also assume that some person at s76 has a market research task of scouting the NVidia support boards and various deep learning forums to figure out what the demand is.
@compiaffe , I'm sure you have a sense like I do of what will happen when people start going to those forums and saying "PopOS has CUDA support that just works" ...
@Rapha Sorry, when I looked at your message before I only saw the part that said "Is there a workaround or a fix on the horizon?"
The behavior you experienced is consistent with the problems I found.
nvidia-docker expects to find /usr/local/cuda-9.1 etc ...
On Mon, Jun 11, 2018 at 5:43 AM, Rapha notifications@github.com wrote:
The system76 deb https://support.system76.com/articles/cuda/ does not seem to install CUDA or at least it is not visible to i.e. nvidia-docker.
@BStudent https://github.com/BStudent I assume that is what you were referring to in your last message? @jackpot51 https://github.com/jackpot51 Is there a workaround or a fix on the horizon?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/system76/docs/issues/84#issuecomment-396230130, or mute the thread https://github.com/notifications/unsubscribe-auth/AS4Ciq3KMlOp5LTlBP7tChxqUHJmrNTUks5t7mXygaJpZM4UCyaQ .
-- B. Student B.Student.CFA@gmail.com
"There's no such thing as spare time, no such thing as free time, no such thing as downtime, all you've got is lifetime. Go.”
@Bstudent that one solved itself. I'm working with an external graphics card which needed to be plugged in. Once the graphics drivers are loaded nvidia-docker runs smoothly. I also noted that the newest system76-nvidia driver is a version 375 if I remember correctly.
On June 11, 2018 11:17:58 PM UTC, "Bourbakis Student, Ph.D., CFA" notifications@github.com wrote:
@Rapha Sorry, when I looked at your message before I only saw the part that said "Is there a workaround or a fix on the horizon?"
The behavior you experienced is consistent with the problems I found.
nvidia-docker expects to find /usr/local/cuda-9.1 etc ...
On Mon, Jun 11, 2018 at 5:43 AM, Rapha notifications@github.com wrote:
The system76 deb https://support.system76.com/articles/cuda/ does not seem to install CUDA or at least it is not visible to i.e. nvidia-docker.
@BStudent https://github.com/BStudent I assume that is what you were referring to in your last message? @jackpot51 https://github.com/jackpot51 Is there a workaround or a fix on the horizon?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/system76/docs/issues/84#issuecomment-396230130, or mute the thread
-- B. Student B.Student.CFA@gmail.com
"There's no such thing as spare time, no such thing as free time, no such thing as downtime, all you've got is lifetime. Go.”
- Henry Rollins
-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/system76/docs/issues/84#issuecomment-396416402
-- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Electricity helps! I haven't played with nvidia-docker much, but that makes sense. Have you run many packages, frameworks, or samples in the container yet?
On Mon, Jun 11, 2018 at 10:08 PM, Rapha notifications@github.com wrote:
@Bstudent that one solved itself. I'm working with an external graphics card which needed to be plugged in. Once the graphics drivers are loaded nvidia-docker runs smoothly. I also noted that the newest system76-nvidia driver is a version 375 if I remember correctly.
On June 11, 2018 11:17:58 PM UTC, "Bourbakis Student, Ph.D., CFA" < notifications@github.com> wrote:
@Rapha Sorry, when I looked at your message before I only saw the part that said "Is there a workaround or a fix on the horizon?"
Today NVIDIA sent out a general email blast to their gpu- and dev-centric mailing lists inviting evabody on linux* with a GPU to join the NGC and use its container registry.
Anyone can now sign up. Originally, containers were only for use in cloud services, but about 6 mos ago NVidia announced that anyone with a Titan V or Titan Xp could download the containers and use them locally as well. At the time, they did NOT announce the fact that actually anybody with a Pascal or higher architecture card (e.g. 10x0 and 10x0Ti) could also download and use the containers.
As @compiaffe discovered, the containers only really need to do an lspci
to figure out what GPUs are on your PCIe bus. And they just need the graphics driver to .... drive graphics. All the other carp is taken care of.
This assumes your computer is having a "docker works again" day.
The array of containers is impressive, and exposes full features of the card.
Sign up here: https://ngc.nvidia.com/
B
in this context, linux means prepackaged for Ubuntu and CENTOS 7/RHEL 7, ymmv elsewhere.
Note also that Anaconda cloud has a variety of official and unofficial CUDA implementations and related GPU-friendly packages.
Was able to get CUDA 9.2 installed, compile all the samples, and run them with the following Bash script:
#!/bin/bash
DEB="cuda-repo-ubuntu1710_9.2.88-1_amd64.deb"
KEY="https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1710/x86_64/7fa2af80.pub"
NET_INSTALLER="https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1710/x86_64/$DEB"
set -ex
wget $NET_INSTALLER
sudo dpkg -i $DEB
sudo apt-key adv --fetch-keys $KEY
sudo apt update
sudo apt install -y cuda-{toolkit,tools,compiler,libraries,libraries-dev}-9.2 \
g++ freeglut3-dev build-essential libx11-dev libxmu-dev \
libxi-dev libglu1-mesa libglu1-mesa-dev
No reboot required. Had system76-driver-nvidia
installed beforehand.
Now just need to look into getting it packaged locally, so as not rely on NVIDIA's debian repo.
@mmstick that's fantastic.
Packaging it yourself and using system76-driver-nvidia is a great way to leave nothing to chance.
Along that same line, unless there's some practical reason it cannot be done, I recommend setting the key environment variables:
export PATH=/usr/local/cuda-9.2/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-9.2/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
... because user doesn't have a choice to put the bins and libs anyplace else, no point invoking the cost of a support request from someone who didn't read the NVIDIA manual.
As a matter of fact, for the NVIDIA version of Pop!OS I say just preinstall CUDA.
I will try to replicate: I've got an Oryx Pro fresh out of the box and also a clean install of the latest Pop on an Acer Nitro AN515-51.
@mmstick are you using the available system76 driver on the ppa?
@BStudent Yes, I have system76-driver-nvidia
installed, which is there by default on our hardware. I am in the process of packaging it from the runfile, and already have a profile script in place to set the environment variables. Just need to deal with some conflicts between path-handling and packaging issues with the installer (the runfile does not expect to be run in an environment that builds packages).
Yup. Are you also installing the samples and the script that transfers them? The reason I ask is that NV used to make them separately downloadable - and the links on the NV site might lead you to believe they still are. But no, the only way to install them is via the installer. That's a good thing to include but I'm guessing you already thought of it. Anyway: congratulations, @mmstick, you have beat the game.
With that done, It's probably a very short distance to offering a "deep learning stack," which, when announced on hacker news, nvidia cuda dev forums, and deep learning subreddits will cause a traffic jam on the pop_os download servers and hopefully back up the hardware supply chain as well:
1) (the easy part) Add current versions of TensorFlow, CNTK, KERAS, and (Real) Torch (the latter probably bundled with Lua) to the Pop Shop. I don't think there's been an ML framework that just runs on install, other than the $50K+ DGX workstation.
Side note: might want to also toss MS-R-OPEN, Intel Python, and RStudio in to the "Deep Learning Department" of PopShop as well, although those last three items are not CUDA.
2) (this has a potentially tricky part: it might require alternate gcc - not sure): There are (or were) a few popular learning frameworks that run a version or two behind on CUDA. I'm fairly sure that versions of CUDA usually work with later versions of the nvida graphics driver, and cuda is architected for side-by-side installs via the alternatives sytem. So creating backports of the installers 9.1 and 9.0, in their own packages installable side-by side would get maximum user coverage. Of the four most-likely-to-stick-around packages (Tensorflow, CNTK, KERAS, Torch), I don't think the current version of any of them uses a CUDA version lower than 9.0, however 8.0 and 7.5 were popular in their day.
Getting CUDA and a couple of frameworks going out of the box without hassle will be a thing worth shouting from the rooftops.
B
@BStudent So, while I originally had the samples included as a separate package, they're not going to be included because the samples include artwork assets which only NVIDIA has the rights to distribute.
However, I do see that NVIDIA has a new samples project hosted on GitHub starting with the CUDA 9.2 Toolkit release, which does seem to be redistributable, so I will be including that.
So creating backports of the installers 9.1 and 9.0, in their own packages installable side-by side would get maximum user coverage.
The packaging project that I'm working on does currently use update-alternatives
, so it should be possible to get multiple toolkits installed at the same time, and simply alternate between them.
@BStudent TensorFlow seems to depend on NVIDIA's cuDNN library. The license agreement doesn't allow us to redistribute it. It both forbids and allows it at the same time, and it opens System76 up to being regularly audited by NVIDIA if we distribute it. So, I don't think that will be possible.
TensorFlow should move towards using an open source DNN library, or NVIDIA should open source / relax their restrictions on distributors.
@BStudent So, while I originally had the samples included as a separate package, they're not going to be included because the samples include artwork assets which only NVIDIA has the rights to distribute.
Yeah I think someone dropped the ball on that one - some of those go way back to the bad old days. Probably a good compromise move would be to include in the README.md instructions for downloading the RUNFILE and executing it with the flag that only inistalls the demo files. To a first approximation the samples don't change much. I can volunteer to write the three lilne blurb myself :-). BTW, if you haven't played with nbody (esp on a multi GPU machine), you should check it out ...
However, I do see that NVIDIA has a new samples project hosted on GitHub starting with the CUDA 9.2 Toolkit release, which does seem to be redistributable, so I will be including that.
Yeah they hit the high points on the github version, but it's got a long way to go to be comprehensive.
Totally agree on DNN, it might even be worth asking NV to open it up. Probably the issues is that they wanted to get to market fast and used code from some third party who wants to get paid.
Meanwhile, people just gotta learn Lua. Also, CNTK is supposed to be very good. I'm sure it integrates well w/vscode.
Actually, a cool project would be to port Torch to Rust. Sadly I am already so overcommitted it ain't funny.
@mmstick I created a basic document for sample library download / install instructions and pushed you an invite to the project (CUDAPOP_production), so you can fork it and hand it off to whoever is in charge of that. It is probably better than starting from a blank sheet of paper.
@BStudent Thanks! You may be happy to know that I've been able to get the CUDA 9.0 / 9.1 / 9.2 toolkits installed alongside each other (each fully patched) with the packaging project that I'm working on. Changing between toolkits is as simple as:
sudo update-alternatives --config cuda
We might also end up packaging cuDNN as well. So Tensorflow and other projects might still be a possibility.
Proof:
seriously, if you've ever checked out the stress levels on askubuntu and the nvidia dev forums over people trying to get cuda working when a new nvidia driver comes out ... Nv should be very motivated to open and/or let you package cuDNN. Totally different economic pressures than with graphics drivers.
@BStudent I've developed tooling so that anyone can easily (ish) build, maintain, and host their own CUDA toolkit repo locally, if desired. This is likely useful in environments with many machines, or where there are strict requirements in place on how software can be obtained (locally, not externally).
The process effectively goes along the lines of:
cargo install
, though it requires nightly at the moment.acquire_assets
script from that directory to pull in all the needed assets. It uses checksums to verify that assets are valid, and can be re-run at a later time as this repo config is updated.origin
, label
, email
, etc. in the sources.toml
to the values you want.debrepbuild
from the repo
directory, and wait for it to finish. It can be resumed if it fails on a certain package, without needing to rebuild packages that were already built. It may take a very long time, due to the size of the toolkit (1.2 GB per deb package). You may comment out the toolkits which you don't wish to build / download.dists
and pool
directory generated which can then be publicly hosted on your package server.The debrepbuild
utility is also what we use for generating our Launchpad-less proprietary repo.
That would address issues I have personally experienced. -- B. Student "There's no such thing as spare time, no such thing as free time, no such thing as downtime, all you've got is lifetime. Go.”
If you do the build, then add the following in your sources list:
deb file:/<PATH_TO_DIR_WITH_DISTS_AND_POOL> bionic main
Then an apt update
, you should get the following:
yes, I'm definitely going to take it from the top with my oryx pro
cuDNN is packaging -- working on TensorFlow, at the moment.
cool - when I pm'ed you earlier I hadn't yet carefully scrubbed the instructions in your past few posts, which was on this afternoon's task list. Now that I have, I see that my questions about custom configs are indeed answered. Although it looks like the first thing I need to do is set up a separate build server.
Prior to making any changes to my oryx pro, I decided to use my old beater laptop (Acer Nitro AN515-15 w/ 1050Ti) as a guinea-pig for test installing the latest publicly-released code. Obviously this might not work for arbitrary non-System76 hardware, but it's a good sign. Steps as follows:
~1. clean-installed the latest nvidia build of Pop from ISO on website~ not sure it was a clean install
I'm optimistic that this works on a random gaming laptop.
Once I get the repos up (after getting TensorFlow packaged), it will just be
sudo apt install tensorflow-1.8-cuda-9.2
Which will pull in bazel
& system76-cudnn-9.2
, which pulls in system76-cuda-9.2
, which pulls system76-cuda
. My major changes to debrep
are finished as of yesterday, so it's mostly a matter of waiting for TensorFlow to compile (100 minutes per run) and seeing if everything's in order to ship.
OK say the word when it's ready and I'll try that one-liner on my oryxpro2. I've held it back from installing cuda or tf-type frameworks but otherwise subjected it to normal use. Good test case.
Packaging TensorFlow with Bazel has been a major PITA. It's hard to imagine that these are industry standards with how ill-equipped their compiling process is. So I'm going to push the repo with just CUDA + cuDNN + Bazel for now. I'll need a bit more time figure out how I can get TensorFlow to build within a sbuild chroot, and I can add that to the repo later.
Certainly everyone I know who works at google is smart, and the ones who are pure devs indicate that there are well-honed procedures for keeping ops tidy, so the lack of discipline in such a popular piece of code is a mystery when we look at it from the technical side. But ...
Bear in mind that Google makes no money off Tensorflow ... EXCEPT ... when you run it on Google Cloud Compute servers. Google sells time at hourly rates. It is against NVidia policy to allow consumer GPU cards to be used in datacenters. Here is Google's lowest advertised rate for their lowest-end (and lowest $/FLOP) GPU:
1 x Tesla K80 Non-Preemptable Execution = $230/mo ~ 7TF for 32bit FMA
(TF = Trillion FLOPs (Floating-Point Ops) per second, which must always be quantified, in this case the FLOP is represented by FMA = Floating-point Multiply-Accumulate, the standard vector operation used for inner products and solving linear systems, compared using 32-bit floats for which no NVIDIA processors have impeded operations).
The above price (pulled off Google just now) is probably for a 2GPU K80, which is the standard config (i.e. two chips per card, sublinear speedup for most applications), with 24GB DDR5 vs 11GB on the 1080Ti. I use 32-bit because the consumer GTX 1080Ti is artificially impaired so that its 64bit FMA performance is an order of magnitude slower than its 32bit FMA. If I recall correctly, the K80 is actually also impaired for 64-bit, but not in a comparable way.
Regardless, according to a series of molecular dynamics simulations run in 32-bit Float by:
http://ambermd.org/gpus/benchmarks.htm#Benchmarks
... we can see that across the board the GTX 1080Ti is about double the performance of a "Half K80," i.e. one chip running 12GB memory. Assuming Google is giving you the Full K80, note in the benchmark that only adds between 10% - 20% speedup.
Splitting the difference, we can assume that nominally the GTX1080Ti is about 75% faster, i.e. 1.75x the speed of a K80 for 32-bit FMA. Best case for efficiency, if tensorflow can get linear speedup from the K80, then the K80 and the GTX1080Ti are about the same speed. So let's give the K80 the benefit of the doubt and make the initial math easier:
Currently, I can purchase a genuine 1080Ti from Nvidia's Website for $699 (Limit 10 per customer, free shipping) plus tax ~- so about $1400 if you live in California -~ Let's just ignore tax. So $699/$230 ~= 3, so you can buy a GTX 1080Ti with three months' cloud time (AWS costs more!).
But that's Calendar time. We really want to know in payback in terms of how much work gets done by the GPU's. I that calculation, what I called the "best case for efficiency" above is the "worst case for payback time" if we consider buiyng a 1080Ti vs "renting" cloud time from GOOG:
Worst Case Scenario: GTX 1080Ti purchase is break-even wrt cloud computing in 3 months.
Nominal Case: GTX 1080Ti break-even takes 3/1.75 = 1.7 months = 7 weeks + 3 days.
Now consider that AWS is Amazon's single biggest source of profit. And Google's annual reports explicitly say that "catching up" with Amazon in terms of earnings from Cloud Computing is one of their primary corporate goals.
So, why is it so hard to buiild Tensorflow? Why don't they put in the time and effort to make it easier for people to deploy on their own computer?
One thing for sure: nobody in the executive suite is pushing hard on the devs to make it a priority ...
:-)
I wouldn't say that building it is hard. The issue lies more in how it is built, where it builds to, how it installs, and how you configure the build. It's standard practice for all software projects to use either GNU Make, CMake, Meson, etc. Yet they are using quite a different setup, and they seem to have designed their build system with minimal regard to packaging.
bash compile.sh
to compile bazel, and it pulls dependencies from the Internet as it does this.export HOME=build
in the build script.Eventually I got them to compile on my development system, but I have yet to figure out why it fails on our Jenkins-managed build server.
@BStudent Okay, CUDA 9.0, 9.1, and 9.2 are now in our proprietary repo, which is enabled by default on all new installs of Pop. I'll have cudnn-9.1 in there soon to go alongside cudnn-9.0 and cudnn-9.2. Resuming my packaging efforts for Bazel + TensorFlow.
~/S/debrepbuild (master:971ed2|+0 -0) # apt search system76-cud
Sorting... Done
Full Text Search... Done
system76-cuda/bionic,now 0pop2 amd64 [installed,automatic]
NVIDIA CUDA Compiler / Libraries / Toolkit Metapackage
system76-cuda-9.0/bionic,now 0pop1 amd64 [installed,automatic]
NVIDIA CUDA 9.0 Compiler / Libraries / Toolkit
system76-cuda-9.0-dbgsym/bionic 0pop1 amd64
debug symbols for system76-cuda-9.0
system76-cuda-9.1/bionic 0pop1 amd64
NVIDIA CUDA 9.1 Compiler / Libraries / Toolkit
system76-cuda-9.1-dbgsym/bionic 0pop1 amd64
debug symbols for system76-cuda-9.1
system76-cuda-9.2/bionic,now 0pop1 amd64 [installed,automatic]
NVIDIA CUDA 9.2 Compiler / Libraries / Toolkit
system76-cuda-9.2-dbgsym/bionic 0pop1 amd64
debug symbols for system76-cuda-9.2
system76-cudnn-9.0/bionic,now 7.1.4~0pop1 amd64 [installed]
NVIDIA CUDA Deep Neural Network library (cuDNN) for CUDA 9.0
system76-cudnn-9.2/bionic,now 7.1.4~0pop1 amd64 [installed]
NVIDIA CUDA Deep Neural Network library (cuDNN) for CUDA 9.2
~/S/debrepbuild (master:971ed2|+0 -0) # apt policy system76-cuda-9.2
system76-cuda-9.2:
Installed: 0pop1
Candidate: 0pop1
Version table:
*** 0pop1 500
500 http://apt.pop-os.org/proprietary bionic/main amd64 Packages
100 /var/lib/dpkg/status
I noticed one of my PoP machines just did an update that looks like it included half the internet, plus a new 396.37 driver.
BTW, what you described in your previous message reeeeaaallllyy sounds like confirming evidence. Most big AI / analytics & modeling projects I've worked on that are genuine research are the exact opposite of systems programming goals: a given script or snippet you're trying afresh has about 50% - 75% chance of being thrown out because it just wasn't a good idea, or the right idea, and it's part of a larger project that itself has maybe 30% - 50% chance of NOT being scrapped. Astro at Google X does a great job of explaining why that's actually a smart thing, and there are a couple of good youtube videos where he talks about the need to "fail fast." So people are writing code like it has a 60%-90% chance of being thrown out in the long run, because that's the actual case. Given TFs dataflow structure and heavy presence of Stanford Engineering (a long-time Matlab shop) at google, TF probably started out mostly as individual neural net systems for specific individual projects, implemented as chunks of Matlab code connected by Simulink blocks - everybody doing this realizes it scales well and they need to pool resources and standardize to create the first prototype version of what will eventually be TF.
As the idea pans out, what typically happens is that the research team starts getting a little more careful, but their goal is to get to a feature complete prototype. At that point they write a spec with the algorithms and equations and hand it over to systems / devops teams without handing over the code: this is a sanity check, and a helpful analytic technique. Why? Sometimes AI/ML/analytics programs give weird unexpected results, and while it might be a bug, it can also be that the code is written as intended, but there is an unanticipated way in which the algorithm, i.e. the idea itself, breaks down and must be fixed - and finally, everything might be working perfectly and the weird result is a genuine anomaly. A good first thing to check is whether the same result is happening in both the prototype and the production models, which have no common source code.
So generally, research and dev will maintain two independent code bases, at least for a couple of years. It's a common practice - I've done this working for multiple companies.
So, as you've probably guessed, your description of the way the TF code base is thrown together sounds exactly like what we'd expect to see from the research prototype, rather than what we'd expect to see in systems code written by grownups....
Still working on TensorFlow. Gradually fixing build error after build error, and I'm just about there to getting it packaged for 9.2. I can certainly see that a lot of people have issues with building TensorFlow, judging by all of the issue reports on the project page, and even the continuing comments on issues that have long been closed.
postinst
issue with the CUDA packages that would make it fail in sudo-less systemsThings are shaping up here: My Oryx is doing CUDA/TF smoothly, except it's got a quirk that is only will run NVIDIA mode on the external monitor and will only run INTEL mode on the built-in screen. AFAIK this is not a CUDA thing, but rather some combination of System76 driver configs and PEBKAC on my part. It seems to improve with time if I avoid trying to mess with settings and apply updates as they come out.
But I assume you've got a pretty clear picture of the value your work is providing: for someone whose job is to make forecasts, control processes, or do similar adaptive computations, it's a huge hit to spend weeks getting this stuff set up -- and having it blow up whenever a new video driver is released. There's a lot of people - including students - who would get immediate payback from not having to deal with that.
It's probably prudent to also start thinking about freezing on LTS -tuples of TF, CUDA, and CUDNN. Fortunately you've gotten started at the end of an in-between phase TF 1.9 just came out, with a big push to make it more accessible via deeper Keras integration; so far post-beta, TF has only released painful breaking changes on a full-number release, but that's coming up; my prediction for 2.0 would be closer integration between the main python branch of TF and TFJS - the latter being a separate codebase that's been rebranded and given an API makeover. Meanwhile, CUDA has yet to officially support 18.04, and I'm guessing at this point they would save a raft of major changes for 10.0. And that, in turn, is likely to be linked to the release of the 20-series GPUs in a couple of months.
In addition to legit issues that are a PITA for you, I think a lot of what you see on the TensorFlow boards is that for better or worse, users are being forced to learn new things: a lot of the people who are building from source never built a package from source before. And are perhaps new to linux as well. The people who say "SVD runs slower on TF than it does on my CPU" might be good at Linear Algebra, but have never been forced to think of their system as anything but an abstraction - now they're dealing with a piece of metal that wants to be fed slabs of data of a certain shape and size, grouped in a particular way, and placed into special areas of memory with carefully managed timing.
@BStudent From what I've read on their issue board, I don't think it is just people being forced to learn new things. Especially what I read from this comment, and a few related issues on packaging and compiling.
I've read that there are teams working for three letter agencies that have even been tasked with attempting to package it as well, and TF has formed a dedicated group to bring everyone together towards the goal of making TF packageable. No one has succeeded though, and while the community wants and demands for CMake as the much simpler alternative to Bazel, the TF authors are only interested in using their Bazel build system, and state that anyone using the CMake build system hidden within contrib
will have to maintain the CMake file themselves (which seems like a full time job in itself to keep it updated with all the Bazel changes). Doesn't help that the existing CMake is designed specifically for compiling on Windows and running tests.
To sum it all up, TF will continue to use Bazel and be fundamentally incompatible with how we package software on Linux. They recommend that everyone use Bazel instead of CMake as otherwise you will only end up with a subset of TF. They are open to the idea of someone creating a tool that can convert Bazel BUILD configuration files into CMake files, and providing distribution-compatible tarballs with the generated CMake files so that we would not need to have a Java-based Bazel build system. But ultimately, the state of TF packaging is fundamentally flawed.
I do have TF building in a schroot now. The last remaining issue for me is just figuring out what files I need and where to put them, as well as how to handle C / C++ / Java / Go / Python / etc. packaging, and how to bring it altogether in a system-wide setup. A bit difficult when they've configured Bazel to be compiled and used locally, rather than doing the right thing for packaging.
I'm doing the best I can at the moment to figure out how they are distributing TF. There isn't any documentation on how to install it into the system, or which files need to be packaged after building. They seem to be content with requiring that everyone who builds from source must add the generated build directories to their system paths and building TF applications from the TF sources.
As for performance issues, it does pain me to see that they adamantly propose that everyone use their Python API via Pip, and put most of their work into that, and only briefly mention that they have a C API in a different section. Then completely leave out that they do have a C++ API. I would think that Python is unsuitable for the kinds of things that you might use TF for, as is commonly found when scientific Python applications get rewritten in C / C++ / Rust and suddenly become 100-1000x faster due to now being able to feed more data to the CPU & GPU in less time.
O.M.F.G. To be honest, I am such a major non-fan of Java that I never looked at what a trainwreck Bazel is. But again, a quick check of Wikipedia points out that Bazel is a reduced functionality version of BLAZE, Google's proprietary internal build tool, which in turn mainly exists in order to keep mobile devices in lockstep with their cloud service platforms. I think there's already a consumer revolt of people doing their own builds of TF - certainly RStudio has. I think you're right on all counts, it is record-settingly bad. Although the one guy complaining about the expense of GPUs hasn't been shopping lately. You can get an overclocked EVGA 1080Ti with a waterblock for less the lowest-end I9. And the high-end i9 is 2/3 way to a TITAN-V. The good news is that it's probably not necessary to officially support any languages for TF other than Python and C++. Python is really TF's target language, and the parts of TF that need to go fast the python code is a very thin layer that wraps BLAS and LAPACK libraries (written in FORTRAN) and highly optimized for CPU / GPU). The big slowdown in TF, at least at one time, was that they used a hodgepodge of different math libraries that can't be easily recompiled together - it was decision making right up there with choosing Bazel. But beyond basic Python and C++, there's more bang for the data science buck to be had with Microsoft R/Open (the MKL version), and CNTK. And then, of course, there's this guy https://github.com/termoshtt/accel.
On Wed, Jul 18, 2018 at 10:01 AM Michael Murphy notifications@github.com wrote:
@BStudent https://github.com/BStudent From what I've read on their issue board, I don't think it is just people being forced to learn new things. Especially what I read from this comment https://github.com/tensorflow/tensorflow/issues/13061#issuecomment-364079096, and a few related issues on packaging and compiling.
I've read that there are teams working for three letter agencies (ie: DOE) that have been tasked with attempting to package it as well, and TF has formed a dedicated group to bring everyone together towards the goal of making TF packageable. No one has succeeded though, and while the community wants and demands for CMake as the much simpler alternative to Bazel, the TF authors are only interested in using their Bazel build system, and state that anyone using the CMake build system hidden within contrib will have to maintain the CMake file themselves. Doesn't help that the existing CMake is designed specifically for compiling on Windows and running tests.
To sum it all up, TF will continue to use Bazel and be fundamentally incompatible with how we package software on Linux. They recommend that everyone use Bazel instead of CMake as otherwise you will only end up with a subset of TF. They are open to the idea of someone creating a tool that can convert Bazel BUILD configuration files into CMake files, and providing distribution-compatible tarballs with the generated CMake files so that we would not need to have a Java-based Bazel build system. But ultimately, the state of TF packaging is fundamentally flawed.
I do have TF building in a schroot now. The last remaining issue for me is just figuring out what files I need and where to put them, as well as how to handle C / C++ / Java / Go / Python / etc. packaging, and how to bring it altogether in a system-wide setup. A bit difficult when they've configured Bazel to be compiled and used locally, rather than doing the right thing for packaging.
I'm doing the best I can at the moment to figure out how they are distributing TF. There isn't any documentation on how to install it into the system, or which files need to be packaged after building. They seem to be content with requiring that everyone who builds from source must add the generated build directories to their system paths and building TF applications from the TF sources.
As for performance issues, it does pain me to see that they adamantly propose that everyone use their Python API via Pip, and put most of their work into that, and only briefly mention that they have a C API in a different section. Then completely leave out that they do have a C++ API. I would think that Python is unsuitable for the kinds of things that you might use TF for, as is commonly found when scientific Python applications get rewritten in C / C++ / Rust and suddenly become 100-1000x faster due to now being able to feed more data to the CPU & GPU in less time.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/system76/docs/issues/84#issuecomment-406003536, or mute the thread https://github.com/notifications/unsubscribe-auth/AS4CioPpYpuRQrYW7GKhV4crfEF-XQ1qks5uH2nrgaJpZM4UCyaQ .
-- B. Student B.Student.CFA@gmail.com
"There's no such thing as spare time, no such thing as free time, no such thing as downtime, all you've got is lifetime. Go.”
Currently working on the C++ portion. C was easy. They just have a lot of C / C++ headers scattered throughout different areas, and expect them to be in certain locations. The script that generates the Python wheel seems to have most of the things in the right places, but it lacks quite a few headers. I hope to have some working examples compiling today.
It seems that packaging C + Python + C++ all in one go is effectively impossible. The way they have C++ set up requires that you effectively copy your project into the Tensorflow source code and use Bazel to build it. Figuring out how to decouple it would take a long time. So I will probably make C++ a separate package that builds from FloopCZ's CMake files, which have already done the work.
I don't see a lot of mixed-mode programming on TF (maybe it exists) - I think cloning or forking something like FloopCZ is way more sensible. And despite all the ink in the TF docs about pip and virtualenv being "preferred" (and prominently featuring instructions for Python 2.7), I think most people - including me - use anaconda3 / conda, which is also pushed by most of the MOOCs these days. A most unfortunate name for RHEL users. Using anaconda3 consists of signing up for a free anaconda.org account, which then lets you search for packages on different "channels" - the main ones being anaconda and conda-forge - and then using conda at the command line to create / install / update envs with the packages and versions you want: conda solves the dependency graph to ensure consistency. The packages and other environment parameters are stored as YAML. For $5/mo you can use the anaconda cloud service to host your own builds and environment configs. The anaconda channel already has builds of TF (and everything else) that target CUDA, MKL, and system-native libraries that pretty much work the same across win / mac / linux and are kept up to date - the packages are complete, so, for example, tensorflow-gpu includes not only tf 1.8, but also cudnn and other supporting packages. And pip can run within a conda environment if one needs to put less-mainstream layers on that. I think the easiest and most supportable thing to do for python is to just install anaconda3 / conda together with some pre-built environments - e.g. dependent on whether your laptop currently has the gpu turned on. Either just use anaconda binaries or build your own identical binaries and use anaconda as a benchmark.
@BStudent I'm pretty close to getting this done at the moment. FloopCZ's CMakefiles have been very useful. As it turns out, I can use it to build the C++ libraries & headers, and then build the C & Python API off the copy of TF that it downloaded and configured. So it can be a part of the same package.
Nice.
If you feel obligated to add the python package to a virtual environment manager, I think the first choice would be anaconda - but it's also good to draw the line somewhere and work on other things while waiting for customer feedback.
So far as additional GPU-related stuff, it seems like you've taken down the two major barriers:
There's definitely high interest in Rust on the part of robotics people, which includes algo - high-frequency traders as well as researchers working with agent-based / population-based massive-concurrency models. Given s76' commitment to Rust, making a thin Rust wrapper for CUDA - or some portions of CUDA - might be a good way to go. I'm far from an expert, but it looks like that was tried with LEAF and it collapsed under the weight of ambition, twice. Big roadmap, extra work due to being early in the life cycle of Rust, not a lot of people on the project. Accel seems to be plugging along, partly by proceeding in manageable increments. Working on something like that is a sort of bet on how popular Rust will become, and how quickly. But it's also a bet that can affect the adoption rate of Rust...
@BStudent I remember Leaf. The progress made cannot be undone, so if anyone truly wants to, they could volunteer to maintain & continue it. As Rust becomes more commonplace, I'm sure the demand for frameworks like Leaf & Accel will increase.
So, I finally have TensorFlow 1.9 building within sbuild with a static shared library of the C++ API, as well as providing the C & Python libs. Seems to be working well. I'm going to do another build or two as I shuffle some files around and get everything ready to release into the wild. It will use the alternatives feature that CUDA is also using.
The C++ lib is rather huge, though. It's 514 MiB. Not much can be done about that, though, as otherwise you'd need to build C++ projects from within the tensorflow source code with Bazel.
@mmstick I was just looking at Leaf again ... i thought it was just a wrapper for CUDA etc with some linear algebra functions: but they were basically trying to recreate TF/CNTK -type full framework with compute graph, multiple solvers (optimizers) etc. And do it in a more elegant better-organized way. Maybe Mozilla could do something on that scale, but that's about the smallest org that can sustain a monster project like that. Most of the original frameworks of that crop are slowly dying off - Theano most recently. The 514 MiB doesn't sound that large given the size of TF, actually. I don't know if you've ever spent time taking a class or going through some tutorials in TF, the more I think about it the more impressed I am that it runs at all. I think these frameworks basically offer an almost-infinite number of opportunities for "justifiable" scope creep - moreso than an OS, for example, because at some point the OS is done enough that applications collectively take up way more programmer hours than the kernel. I think there's some principle of design or sw eng that needs to be applied (perhaps invented) to manage learning frameworks. For now, there's probably only a few where API users collectively put in more time than the dev team.
Overview: The Current State