Closed sandipnetcom closed 5 years ago
This problem will be addressed by changes currently residing in the APS-1030-update-mlnx-ofed-roll-to-4.6-1.0.1.1 branch which targets CentOS 7.6 and the MLNX_OFED_LINUX 4.6-1.0.1.1.
To implement the fix for rocks run roll
behavior you can change the ROLLNAME
value in the top level version.mk
to match the kernel version of your build host and MLNX_OFED_LINUX version (which appear to be 4.3-1.0.1.0-3.10.0-693.5.2.el7
) and then, assuming all intermediate RPMs are still in your repository clone, rebuild the roll profile and ISO with the following two commands...
% make profile
% make reroll
Then, copy the roll ISO to your frontend and remove/replace the current mlnx-ofed roll and rebuild your distribution with...
% rocks remove roll mlnx-ofed-4.3-1.0.1.0-3.10.0-693.5.2.el7
% rocks add roll mlnx-ofed-4.3-1.0.1.0-3.10.0-693.5.2.el7*.iso
% rocks enable roll mlnx-ofed-4.3-1.0.1.0-3.10.0-693.5.2.el7
% cd /export/rocks/install
% rocks create distro
% rocks run roll mlnx-ofed-4.3-1.0.1.0-3.10.0-693.5.2.el7 | bash
Thanks for the report and sorry for the problems.
Hello,
Thanks you very murch for help
I am useing rocks 7 centos 7.4 can you please how to build roll for my environment
And I am also download latest version 4.6 but after build in roll its showing 4.3
Please helping for how to build process.
Sorry i am new in hpc roll but as per your document its very good to understand kindly share same document for my environment.
Thanks
Get Outlook for Androidhttps://aka.ms/ghei36
From: Trevor Cooper Sent: Wednesday, 4 September, 11:14 PM Subject: Re: [sdsc/mlnx-ofed-roll] roll not install (#4) To: sdsc/mlnx-ofed-roll Cc: Sandip Saha, Author
This problem will be addressed by changes currently residing in the APS-1030-update-mlnx-ofed-roll-to-4.6-1.0.1.1https://github.com/sdsc/mlnx-ofed-roll/tree/feature/APS-1030-update-mlnx-ofed-roll-to-4.6-1.0.1.1 branch which targets CentOS 7.6 and the MLNX_OFED_LINUX 4.6-1.0.1.1. To implement the fix for rocks run roll behavior you can change the ROLLNAME value in the top level version.mk to match the kernel version of your build host and MLNX_OFED_LINUX version (which appear to be 4.3-1.0.1.0-3.10.0-693.5.2.el7) and then, assuming all intermediate RPMs are still in your repository clone, rebuild the roll profile and ISO with the following two commands... % make profile % make reroll Then, copy the roll ISO to your frontend and remove/replace the current mlnx-ofed roll and rebuild your distribution with... % rocks remove roll mlnx-ofed-4.3-1.0.1.0-3.10.0-693.5.2.el7 % rocks add roll mlnx-ofed-4.3-1.0.1.0-3.10.0-693.5.2.el7*.iso % rocks enable roll mlnx-ofed-4.3-1.0.1.0-3.10.0-693.5.2.el7 % cd /export/rocks/install % rocks create distro % rocks run roll mlnx-ofed-4.3-1.0.1.0-3.10.0-693.5.2.el7 | bash Thanks for the report and sorry for the problems. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/sdsc/mlnx-ofed-roll/issues/4?email_source=notifications&email_token=ANAL5PMNZ4QQKILI4YG5EI3QH7XYVA5CNFSM4ITNPMSKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD54MOIA#issuecomment-528008992, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ANAL5PJPGP3NPE45XKSSI6LQH7XYVANCNFSM4ITNPMSA.
Please follow the instructions in updated README.md file. If you are going to change to a different version of MLNX_OFED_LINUX then you should probably re-clone the repository source to ensure your build environment is clean.
Hello,
I don't know whare I am missing out. today i am remove roll rocks remove roll mlnx-ofed-4.3-1.0.1.0-3.10.0-693.5.2.el7 after delete all folder (export/site-roll/rocks/src/mlnx-ofed-roll) clone git and download ofed file (MLNX_OFED_LINUX-4.6-1.0.1.1-rhel7.4-x86_64.tgz) and place it to /export/site-roll/rocks/src/mlnx-ofed-roll/src/mlnx-ofed-linux after change version.mk file as below root@mnode mlnx-ofed-linux]# cat version.mk NAME = mlnx-ofed-linux VERSION = 4.6 RELEASE = 1.0.1.1 DISTRO = rhel7.4 EXTRA = $(DISTRO)-x86_64 PKGROOT = /opt/mlnx-ofed-linux SRC_SUBDIR = mlnx-ofed-linux SOURCE_NAME = MLNX_OFED_LINUX SOURCE_SUFFIX = tgz SOURCE_VERSION = $(VERSION)-$(RELEASE)-$(EXTRA) SOURCE_PKG = $(SOURCE_NAME)-$(SOURCE_VERSION).$(SOURCE_SUFFIX) SOURCE_PKG_EXT = $(SOURCE_NAME)-$(SOURCE_VERSION)-ext.$(SOURCE_SUFFIX) SOURCE_DIR = $(SOURCE_NAME)-$(SOURCE_VERSION) SOURCE_DIR_EXT = $(SOURCE_NAME)-$(SOURCE_VERSION)-ext TGZ_PKGS = $(SOURCE_PKG) RPM.EXTRAS = AutoReq:No
after i am build ------ make default 2>&1 | tee build.log error comes
/state/partition1/site-roll/rocks/src/mlnx-ofed-roll/BUILD/mlnx-ofed-linux-4.6' make ROOT=/state/partition1/site-roll/rocks/src/mlnx-ofed-roll/src/mlnx-ofed-linux/mlnx-ofed-linux.buildroot build make[4]: Entering directory
/state/partition1/site-roll/rocks/src/mlnx-ofed-roll/BUILD/mlnx-ofed-linux-4.6'
::: Downloading http://forge.sdsc.edu/triton/mlnx-ofed/src/mlnx-ofed-linux/MLNX_OFED_LINUX-4.6-1.0.1.1-rhel7.4-x86_64.tgz :::
::: MLNX_OFED_LINUX-4.6-1.0.1.1-rhel7.4-x86_64.tgz already exists, skipping :::
::: Verifying size of MLNX_OFED_LINUX-4.6-1.0.1.1-rhel7.4-x86_64.tgz :::
make[4]: [MLNX_OFED_LINUX-4.6-1.0.1.1-rhel7.4-x86_64.tgz] Error 1
make[4]: Leaving directory `/state/partition1/site-roll/rocks/src/mlnx-ofed-roll/BUILD/mlnx-ofed-linux-4.6'
make[3]: [build] Error 2
make[3]: Leaving directory `/state/partition1/site-roll/rocks/src/mlnx-ofed-roll/BUILD/mlnx-ofed-linux-4.6'
error: Bad exit status from /var/tmp/rpm-tmp.fezdtw (%build)RPM build errors: Bad exit status from /var/tmp/rpm-tmp.fezdtw (%build) make[2]: [rpm] Error 1 make[2]: Leaving directory `/state/partition1/site-roll/rocks/src/mlnx-ofed-roll/src/mlnx-ofed-linux' make[1]: [rpm] Error 2 make[1]: Leaving directory `/state/partition1/site-roll/rocks/src/mlnx-ofed-roll/src' make: *** [rpms] Error 2
Please help me
thanks Sandip
Please look closely at the steps in the Alternate Versions section of the updated README.md.
It appears you may have missed the step where the binary_hashes
file is updated with file size, hash and filename of your selected MLNX_OFED_LINUX source tar archive using the gen_hash.sh
script.
The link to download that script from the sdsc/skeleton-roll on Github is provided in that section.
Hello, Thanks for your time and support, yes you are right I miss this step now I am building my rocks roll but the issue still same roll not install
[root@mnode mlnx-ofed-roll]# rocks list roll
NAME VERSION ARCH ENABLED
base: 7.0 x86_64 yes
CentOS: 7.4.1708 x86_64 yes
core: 7.0 x86_64 yes
ganglia: 7.0 x86_64 yes
hpc: 7.0 x86_64 yes
htcondor: 8.6.8 x86_64 yes
kernel: 7.0 x86_64 yes
python: 7.0 x86_64 yes
sge: 7.0 x86_64 yes
Updates-CentOS-7.4.1708: 2017-12-01 x86_64 yes
cuda: 7.0 x86_64 yes
slurm: 7.0.0.220 x86_64 yes
sdsc: 7.0 x86_64 yes
mlnx-ofed-4.6-1.0.1.1-3.10.0-693.5.2.el7: 7.0 x86_64 yes
[root@mnode mlnx-ofed-roll]# rocks run roll mlnx-ofed-4.6-1.0.1.1-3.10.0-693.5.2.el7 | bash
Loaded plugins: fastestmirror, langpacks, nvidia
Cleaning repos: Rocks-7.0
Cleaning up everything
Maybe you want: rm -rf /var/cache/yum, to also free up space taken by orphaned data from disabled or removed repos
Cleaning up list of fastest mirrors.
[root@mnode mlnx-ofed-roll]#
kindly help me how to resolve this
Thanks Sandip
Did you do what I suggested in my first response and implement the changes in the version.mk
file or, alternately (and I did not suggest this), checkout the branch with the fix before attempting to build?
hello,
thanks for your support yesterday I am deleted everything and re-clone but I don't know why version.mk is not updating today I update version.mk after that rebuild iso remove and add roll again then I am again install
but today I am found new error
Error: Package: 1:compat-dapl-devel-1.2.19-4.el7.x86_64 (@anaconda/7.0) Requires: libdaplscm.so.1()(64bit) Removing: 1:compat-dapl-1.2.19-4.el7.x86_64 (@anaconda/7.0) libdaplscm.so.1()(64bit) Obsoleted By: mlnx-ofed-all-user-only-4.6-1.0.1.1.skip.distro.check.noarch (Rocks-7.0) Not found Error: Package: openmpi-devel-1.10.6-2.el7.x86_64 (@anaconda/7.0) Requires: liboshmem.so.8()(64bit) Removing: openmpi-1.10.6-2.el7.x86_64 (@anaconda/7.0) liboshmem.so.8()(64bit) Updated By: openmpi-4.0.2a1-1.46101.x86_64 (Rocks-7.0) ~liboshmem.so.40()(64bit) Error: Package: libfabric-1.4.2-1.el7.i686 (Rocks-7.0) Requires: librdmacm.so.1(RDMACM_1.0) Error: Package: openmpi-devel-1.10.6-2.el7.x86_64 (@anaconda/7.0) Requires: libmpi_mpifh.so.12()(64bit) Removing: openmpi-1.10.6-2.el7.x86_64 (@anaconda/7.0) libmpi_mpifh.so.12()(64bit) Updated By: openmpi-4.0.2a1-1.46101.x86_64 (Rocks-7.0) ~libmpi_mpifh.so.40()(64bit) Error: Package: openmpi-devel-1.10.6-2.el7.x86_64 (@anaconda/7.0) Requires: libmpi_usempi.so.5()(64bit) Removing: openmpi-1.10.6-2.el7.x86_64 (@anaconda/7.0) libmpi_usempi.so.5()(64bit) Updated By: openmpi-4.0.2a1-1.46101.x86_64 (Rocks-7.0) ~libmpi_usempi.so.40()(64bit) Error: Package: openmpi-1.10.6-2.el7.i686 (Rocks-7.0) Requires: libibverbs.so.1 Error: Package: libfabric-1.4.2-1.el7.i686 (Rocks-7.0) Requires: libibverbs.so.1 Error: Package: 1:compat-dapl-devel-1.2.19-4.el7.x86_64 (@anaconda/7.0) Requires: libdaplcma.so.1()(64bit) Removing: 1:compat-dapl-1.2.19-4.el7.x86_64 (@anaconda/7.0) libdaplcma.so.1()(64bit) Obsoleted By: mlnx-ofed-all-user-only-4.6-1.0.1.1.skip.distro.check.noarch (Rocks-7.0) Not found Error: Package: libfabric-1.4.2-1.el7.i686 (Rocks-7.0) Requires: librdmacm.so.1 Error: Package: openmpi-1.10.6-2.el7.i686 (Rocks-7.0) Requires: libibverbs.so.1(IBVERBS_1.0) Error: Package: openmpi-1.10.6-2.el7.i686 (Rocks-7.0) Requires: librdmacm.so.1 Error: Package: libfabric-1.4.2-1.el7.i686 (Rocks-7.0) Requires: libibverbs.so.1(IBVERBS_1.1) Error: Package: openmpi-1.10.6-2.el7.i686 (Rocks-7.0) Requires: libibverbs.so.1(IBVERBS_1.1) Error: Package: openmpi-1.10.6-2.el7.i686 (Rocks-7.0) Requires: libosmcomp.so.3 Error: Package: libfabric-1.4.2-1.el7.i686 (Rocks-7.0) Requires: libibverbs.so.1(IBVERBS_1.0) Error: Package: openmpi-1.10.6-2.el7.i686 (Rocks-7.0) Requires: librdmacm.so.1(RDMACM_1.0) Error: Package: 1:compat-dapl-devel-1.2.19-4.el7.x86_64 (@anaconda/7.0) Requires: compat-dapl = 1:1.2.19-4.el7 Removing: 1:compat-dapl-1.2.19-4.el7.x86_64 (@anaconda/7.0) compat-dapl = 1:1.2.19-4.el7 Obsoleted By: mlnx-ofed-all-user-only-4.6-1.0.1.1.skip.distro.check.noarch (Rocks-7.0) Not found Error: Package: 1:compat-dapl-devel-1.2.19-4.el7.x86_64 (@anaconda/7.0) Requires: libdat.so.1()(64bit) Removing: 1:compat-dapl-1.2.19-4.el7.x86_64 (@anaconda/7.0) libdat.so.1()(64bit) Obsoleted By: mlnx-ofed-all-user-only-4.6-1.0.1.1.skip.distro.check.noarch (Rocks-7.0) Not found You could try using --skip-broken to work around the problem You could try running: rpm -Va --nofiles --nodigest /etc/rc.d/rocksconfig.d/RCS/post-97-mlnx-ofed-tuning,v <-- /etc/rc.d/rocksconfig.d/post-97-mlnx-ofed-tuning initial revision: 1.1 done /etc/rc.d/rocksconfig.d/RCS/post-97-mlnx-ofed-tuning,v --> /etc/rc.d/rocksconfig.d/post-97-mlnx-ofed-tuning revision 1.1 (locked) done touch: cannot touch ‘/etc/infiniband/openib.conf’: No such file or directory mkdir: cannot create directory ‘/etc/infiniband/RCS’: No such file or directory chown: cannot access ‘/etc/infiniband/RCS’: No such file or directory ci: /etc/infiniband/RCS/openib.conf: No such file or directory co: /etc/infiniband/RCS/openib.conf,v: No such file or directory /tmp/tmpj9DgMV: line 65: /etc/infiniband/openib.conf: No such file or directory /bin/cp: cannot stat ‘/etc/infiniband/openib.conf’: No such file or directory [root@mnode install]#
Please help me i know every step i am asking to you please help me i am stack.
Thanks Sandip
Unfortunately there is not enough context in your comment to determine what commands you ran and what may have gone wrong although it appears that install of RPMs may have been attempted suggesting you've been more successful than before.
I suggest you repeat your steps saving all commands and output and put it all in a Gist.
Reply here with a link to the Gist and I can have a look.
Please provide the output of the following commands as well...
From your frontend and the system you are building the mlnx-ofed-linux roll on (if different)...
# hostname -s
# cat /etc/{rocks,redhat}*release
# uname -a
# rpm -qa | grep -iE "ofed"
# lsmod 2>&1 | awk '/^mlx/ {print $1}' | xargs modinfo -F filename
From the frontend of your system only...
# rocks list roll | grep -Ei "name|kernel|base|core|cent|mlnx|sdsc"
# rocks list host | grep -iE "membership|frontend|devel"
Hello
Thanks for your time and support
I am clean up and reinstall all kindly check i am creating 2 files one is built log and another is reinstall ofed
"https://gist.github.com/sandipnetcom/ea6cbaa3e5f74f584b2249342d43e319.js
[root@mnode ~]# hostname -s
mnode
[root@mnode ~]# cat /etc/{rocks,redhat}*release
Rocks release 7.0 (Manzanita)
CentOS Linux release 7.4.1708 (Core)
[root@mnode ~]# uname -a
Linux mnode.nml.local 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
[root@mnode ~]# rpm -qa | grep -iE "ofed"
[root@mnode ~]#
[root@mnode ~]# lsmod 2>&1 | awk '/^mlx/ {print $1}' | xargs modinfo -F filename
/lib/modules/3.10.0-693.5.2.el7.x86_64/kernel/drivers/infiniband/hw/mlx4/mlx4_ib.ko.xz
/lib/modules/3.10.0-693.5.2.el7.x86_64/kernel/drivers/net/ethernet/mellanox/mlx4/mlx4_en.ko.xz
/lib/modules/3.10.0-693.5.2.el7.x86_64/kernel/drivers/net/ethernet/mellanox/mlx4/mlx4_core.ko.xz
[root@mnode ~]#
[root@mnode ~]# lsmod 2>&1 | awk '/^mlx/ {print $1}' | xargs modinfo -F filename
/lib/modules/3.10.0-693.5.2.el7.x86_64/kernel/drivers/infiniband/hw/mlx4/mlx4_ib.ko.xz
/lib/modules/3.10.0-693.5.2.el7.x86_64/kernel/drivers/net/ethernet/mellanox/mlx4/mlx4_en.ko.xz
/lib/modules/3.10.0-693.5.2.el7.x86_64/kernel/drivers/net/ethernet/mellanox/mlx4/mlx4_core.ko.xz
[root@mnode ~]# rocks list roll | grep -Ei "name|kernel|base|core|cent|mlnx|sdsc"
NAME VERSION ARCH ENABLED
base: 7.0 x86_64 yes
CentOS: 7.4.1708 x86_64 yes
core: 7.0 x86_64 yes
kernel: 7.0 x86_64 yes
Updates-CentOS-7.4.1708: 2017-12-01 x86_64 yes
sdsc: 7.0 x86_64 yes
mlnx-ofed-4.6-1.0.1.1-3.10.0-957.27.2.el7: 7.0 x86_64 yes
[root@mnode ~]# rocks list host | grep -iE "membership|frontend|devel"
HOST MEMBERSHIP CPUS RACK RANK RUNACTION INSTALLACTION
mnode: Frontend 24 0 0 os install
[root@mnode ~]#
Kindly help me
Thanks Sandip
Hello,
kindly help me please.
Thanks Sandip
Looking at your output this appears to be an Linux install issue and is no longer a roll-build issue. That's good news!
You appear to be attempting to run the roll on your cluster frontend. While this is possible in many cases in your particular case OpenMPI related RPMs (among others) are blocking the install. This is caused by MLNX_OFED_LINUX RPMs obsoleting distro provided RPMs the OpenMPI RPMs depend on. It's likely removal of the OpenMPI RPMs from your fronted may allow the install to complete.
The mlnx-ofed-linux roll is constructed to support install of the MLNX_OFED_LINUX software stack on Rocks systems during kickstart. The roll doesn't always cleanly install on nodes with distro provided Infiniband stack and/or previous versions of this roll. While I have replicated removal / install steps for a similar version of CentOS and mlnx-ofed-linux roll I don't have a matching system to be able to further debug installation specific issues you are having.
Please try reinstalling a compute node which has an Infiniband card to verify that the RPMs from mlnx-ofed-linux roll are being installed correctly. Assuming that is successful, and I have every reason to believe it will be, you can move back to working on the update of your frontend.
This may require removal of all Infiniband related software in order for the install to complete and, depending on the services provided by your frontend, may affect the Infiniband network in your entire system. You should plan accordingly.
Further assistance with correctly installing mlnx-ofed-linux should be requested on the Rocks Email Discussion List. More information about the Rocks Mailing List can be found here.
hello, Thanks for your support I am trying to reinstall compute node while OFED roll enables but its showing error during installation but disable roll OFED compute node successfully installed.
I am still in troubles I need to install OFED can you help me how to install in rocks cluster
thanks
Please take support requests for use of mlnx-ofed-roll to the Rocks Mailing List.
Hello, Please, anyone, help me as per document I configure all as written but I am not install roll
[root@mnode install]# rocks run roll mlnx-ofed | bash
Command output Loaded plugins: fastestmirror, langpacks, nvidia
NVIDIA
Cleaning repos: Rocks-7.0 Cleaning up everything Maybe you want: rm -rf /var/cache/yum, to also free up space taken by orphaned data from disabled or removed repos Cleaning up list of fastest mirrors [root@mnode install]#
rocks roll list [root@mnode ~]# rocks list roll NAME VERSION ARCH ENABLED base: 7.0 x86_64 yes
CentOS: 7.4.1708 x86_64 yes
core: 7.0 x86_64 yes
ganglia: 7.0 x86_64 yes
hpc: 7.0 x86_64 yes
htcondor: 8.6.8 x86_64 yes
kernel: 7.0 x86_64 yes
python: 7.0 x86_64 yes
sge: 7.0 x86_64 yes
Updates-CentOS-7.4.1708: 2017-12-01 x86_64 yes
cuda: 7.0 x86_64 yes
slurm: 7.0.0.220 x86_64 yes
mlnx-ofed-4.3-1.0.1.0-3.10.0-693.5.2.el7: 7.0 x86_64 yes
please help me how I am missing
Thanks,