nbcrrolls / cuda

Cuda Roll
4 stars 8 forks source link

CUDA roll on fresh Rocks 7 install #6

Closed monadnoc closed 6 years ago

monadnoc commented 6 years ago

@nadyawilliams

For a fresh install of Rocks 7.0

make 2>&1 | tee build.log yields first argument to 'word' function must be greater than '0'

make --debug 2>&1 | tee build.log identified that make is trying to remake target 'dump-version' and is invoking Rules.mk:611 to do so.

This is a pretty early break in the make process--is Rocks 7 (CentOS 7.4) not supported? Is Nvidia Toolkit 10.0 the problem?

nadyawilliams commented 6 years ago

Toolkit 10 has not been tested yet. You will have to try out building RPMs separately and depending on how the packaging of toolkit 10 vs. 8 have changed make changes to Makefile and version.mk in corresponding src/nvidia-toolkit and src/nvidia-driver. Basically, cd src/nvidia-toolkit make rpm and see what breaks Every time there is a major nvidia toolkit release there need to be changes to the build process

monadnoc commented 6 years ago

I changed to toolkit 8.0 (https://developer.nvidia.com/cuda-80-ga2-download-archive).

The same issue described occurs--first argument to 'word' function must be greater than '0'--though it seems this might just be because the 'dump-version' is echoing 7.0. The full output of this error line is actually /opt/rocks/share/devel/src/roll/../../etc/Rules.mk:622: *** first argument to `word' function must be greater than 0. Stop. but it is unclear what part of Rules.mk at line 622 requires the the input for 'word' to be greater than 0

build.log from this new attempt with Toolkit 8.0 is attached

I also tried make rpm as suggested, but that halts on trying to tar a file/folder that's not there: tar: /opt/cuda/SOURCES/roll-cuda-7.0.tar: Cannot open: No such file or directory tar: Error is not recoverable: exiting now Everything leading up to this error in the output by make --debug rpm is attached in build_rpm.log build.log build_rpm.log

nadyawilliams commented 6 years ago
  1. In what directory you checked out there git repo? If this is /opt/cuda, then it is wrong. /opt/cuda is supposed to be clean and will be used for install.
  2. looks like you are running "make rpm" at the top level of repo directory. This command need to be run in src/nvidia-toolkit.
  3. try to checkout a clean repo and follow the instructions for the basic build where the first step is to execute bootstrap.sh command. This will download a specific cuda toolkit and driver. Then try to build rpms in each src/nvidia-toolkit and src/nvidia-driver
monadnoc commented 6 years ago

Both the nvidia toolkit and driver exited with a successfully remade target file 'rpm' (see attached), but the make 2>&1 command to build the .iso still exits with the same /opt/rocks/share/devel/src/roll/../../etc/Rules.mk:622: *** first argument to 'word' function must be greater than 0. Stop. when trying to remake the dump-name target.

Any ideas?

build_rpm.log build_rpm_driver.log

nadyawilliams commented 6 years ago

what is the output of rocks list roll

and the output of make -n 2>&1 > out

and the output of make preroll

monadnoc commented 6 years ago

rocks list roll:

NAME                     VERSION    ARCH   ENABLED
base:                    7.0        x86_64 yes    
CentOS:                  7.4.1708   x86_64 yes    
core:                    7.0        x86_64 yes    
kernel:                  7.0        x86_64 yes    
Updates-CentOS-7.4.1708: 2017-12-01 x86_64 yes 
sge:                     7.0        x86_64 yes    
hpc:                     7.0        x86_64 yes    
ganglia:                 7.0        x86_64 yes 

make -n 2>&1 > out

/opt/rocks/share/devel/src/roll/../../etc/Rules.mk:622: *** first argument to `word' function must be greater than 0.  Stop.

and 'out' reads echo 7.0

lastly, make preroll:

for i in `ls nodes/*.xml.in`; do \
    export o=`echo $i | sed 's/\.in//'`; \
    cp $i $o; \
    sed -i -e "s/TOOLKIT_SHORT/80/g"  $o; \
done

Thank you very much for you time working through this

nadyawilliams commented 6 years ago

You have a bit older Updates roll but this should not really matter for make.
What is rpm -qf /opt/rocks/share/devel/src/roll/../../etc/Rules.mk and rpm -V rocks-devel and "pwd" at the top level of your cloned repo

monadnoc commented 6 years ago

rpm -qf /opt/rocks/share/devel/src/roll/../../etc/Rules.mk rocks-devel-7.0-9.x86_64

rpm -V rocks devel has no output

pwd in cloned repo is /root/cuda

nadyawilliams commented 6 years ago

everything so far looks ok, so i am not sure what is causing he problem. Have you changed any files after clowning the repo or have you made any updates to the system outside of rocks commands, your root user environment?

monadnoc commented 6 years ago

The only change made to the cloned repo was to substitute the driver for a Tesla K40c (and changed the version.mk file to correspond), but even with the driver provided with the bootstrap.sh, the same error occurs.

The rocks install is fresh and follows the installation guidelines--nothing has been modified so far.

Since this error seems related to my system rather than the repo, I have modified the title of the issue, and I will close it for now until I become more familiar with the Rocks configuration.

Thank you very much for the help (some of it nearly in real-time!)

freecurve commented 6 years ago

Hi - I know this is closed, but indeed 'make rpm' in src/nvidia-toolkit fails for cuda-linux_10.0.130-linux.run which is cuda-10 (I know cuda 8 is expected). I am not an expert, but the DISTRO variable in src/cuda-toolkit/version.mk may need to be changed to the cuda 10 version. Also there's the usual headache with how nvidia names their *run packages: the exact error is : make ROOT=/share/apps/cuda_rocks_roll2/cuda/src/nvidia-toolkit/cuda-toolkit10.buildroot install make[2]: Entering directory /share/apps/cuda_rocks_roll2/cuda/BUILD/cuda-toolkit10-10.0.130' ///from nvidia distro install toolkit and samples in /opt mkdir -p distro /bin/bash cuda_10.0.130_linux-run -extract=pwd`/distro /bin/bash: cuda_10.0.130_linux-run: No such file or directory

whereas the file created is cuda-linux_10.0.130-linux.run

so if you put a 'cuda_10.0.130_linux-run' link to whatever cuda calls their *rn file in your src/nvidia-driver (and change the DISTRO variable in version.mk) then 'make rpm' completes successfully.

As always, thank you for this roll Nadya Williams!

nadyawilliams commented 6 years ago

You are right about one variable DISTRO in the src/nvidia-toolkit/version.mk file. Looking back it may have been an accidental commit while playing with one of the driver versions. This should be a variable that is pulled from the toolkit and version numbers that are recorded in top level cuda.mk file (fixed now). As far as nvidia toolkit and driver files naming conventions go, it is never consistent and one MUST edit version.mk file and make proper adjustments. Currently, there are lines there for versions 7 and 8. I have not played with 10 so far. But the approach should be for the most part the same.

freecurve commented 6 years ago

I'm glad that was helpful. Maybe given nvidia's penchant for creative naming just put the necessary logic in bootstrap.sh or tell the user... Anyhow, all is well, thank you very much Nadya!