sdsc / mlnx-ofed-roll

SDSC mlnx-ofed roll
1 stars 4 forks source link

roll not install #4

Closed sandipnetcom closed 5 years ago

sandipnetcom commented 5 years ago

Hello, Please, anyone, help me as per document I configure all as written but I am not install roll

[root@mnode install]# rocks run roll mlnx-ofed | bash

Command output Loaded plugins: fastestmirror, langpacks, nvidia

NVIDIA

Cleaning repos: Rocks-7.0 Cleaning up everything Maybe you want: rm -rf /var/cache/yum, to also free up space taken by orphaned data from disabled or removed repos Cleaning up list of fastest mirrors [root@mnode install]#

rocks roll list [root@mnode ~]# rocks list roll NAME VERSION ARCH ENABLED base: 7.0 x86_64 yes
CentOS: 7.4.1708 x86_64 yes
core: 7.0 x86_64 yes
ganglia: 7.0 x86_64 yes
hpc: 7.0 x86_64 yes
htcondor: 8.6.8 x86_64 yes
kernel: 7.0 x86_64 yes
python: 7.0 x86_64 yes
sge: 7.0 x86_64 yes
Updates-CentOS-7.4.1708: 2017-12-01 x86_64 yes
cuda: 7.0 x86_64 yes
slurm: 7.0.0.220 x86_64 yes
mlnx-ofed-4.3-1.0.1.0-3.10.0-693.5.2.el7: 7.0 x86_64 yes

please help me how I am missing

Thanks,

tcooper commented 5 years ago

This problem will be addressed by changes currently residing in the APS-1030-update-mlnx-ofed-roll-to-4.6-1.0.1.1 branch which targets CentOS 7.6 and the MLNX_OFED_LINUX 4.6-1.0.1.1.

To implement the fix for rocks run roll behavior you can change the ROLLNAME value in the top level version.mk to match the kernel version of your build host and MLNX_OFED_LINUX version (which appear to be 4.3-1.0.1.0-3.10.0-693.5.2.el7) and then, assuming all intermediate RPMs are still in your repository clone, rebuild the roll profile and ISO with the following two commands...

% make profile
% make reroll

Then, copy the roll ISO to your frontend and remove/replace the current mlnx-ofed roll and rebuild your distribution with...

% rocks remove roll mlnx-ofed-4.3-1.0.1.0-3.10.0-693.5.2.el7
% rocks add roll mlnx-ofed-4.3-1.0.1.0-3.10.0-693.5.2.el7*.iso
% rocks enable roll mlnx-ofed-4.3-1.0.1.0-3.10.0-693.5.2.el7
% cd /export/rocks/install
% rocks create distro
% rocks run roll mlnx-ofed-4.3-1.0.1.0-3.10.0-693.5.2.el7 | bash

Thanks for the report and sorry for the problems.

sandipnetcom commented 5 years ago

Hello,

Thanks you very murch for help

I am useing rocks 7 centos 7.4 can you please how to build roll for my environment

And I am also download latest version 4.6 but after build in roll its showing 4.3

Please helping for how to build process.

Sorry i am new in hpc roll but as per your document its very good to understand kindly share same document for my environment.

Thanks

Get Outlook for Androidhttps://aka.ms/ghei36

From: Trevor Cooper Sent: Wednesday, 4 September, 11:14 PM Subject: Re: [sdsc/mlnx-ofed-roll] roll not install (#4) To: sdsc/mlnx-ofed-roll Cc: Sandip Saha, Author

This problem will be addressed by changes currently residing in the APS-1030-update-mlnx-ofed-roll-to-4.6-1.0.1.1https://github.com/sdsc/mlnx-ofed-roll/tree/feature/APS-1030-update-mlnx-ofed-roll-to-4.6-1.0.1.1 branch which targets CentOS 7.6 and the MLNX_OFED_LINUX 4.6-1.0.1.1. To implement the fix for rocks run roll behavior you can change the ROLLNAME value in the top level version.mk to match the kernel version of your build host and MLNX_OFED_LINUX version (which appear to be 4.3-1.0.1.0-3.10.0-693.5.2.el7) and then, assuming all intermediate RPMs are still in your repository clone, rebuild the roll profile and ISO with the following two commands... % make profile % make reroll Then, copy the roll ISO to your frontend and remove/replace the current mlnx-ofed roll and rebuild your distribution with... % rocks remove roll mlnx-ofed-4.3-1.0.1.0-3.10.0-693.5.2.el7 % rocks add roll mlnx-ofed-4.3-1.0.1.0-3.10.0-693.5.2.el7*.iso % rocks enable roll mlnx-ofed-4.3-1.0.1.0-3.10.0-693.5.2.el7 % cd /export/rocks/install % rocks create distro % rocks run roll mlnx-ofed-4.3-1.0.1.0-3.10.0-693.5.2.el7 | bash Thanks for the report and sorry for the problems. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/sdsc/mlnx-ofed-roll/issues/4?email_source=notifications&email_token=ANAL5PMNZ4QQKILI4YG5EI3QH7XYVA5CNFSM4ITNPMSKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD54MOIA#issuecomment-528008992, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ANAL5PJPGP3NPE45XKSSI6LQH7XYVANCNFSM4ITNPMSA.

tcooper commented 5 years ago

Please follow the instructions in updated README.md file. If you are going to change to a different version of MLNX_OFED_LINUX then you should probably re-clone the repository source to ensure your build environment is clean.

sandipnetcom commented 5 years ago

Hello,

I don't know whare I am missing out. today i am remove roll rocks remove roll mlnx-ofed-4.3-1.0.1.0-3.10.0-693.5.2.el7 after delete all folder (export/site-roll/rocks/src/mlnx-ofed-roll) clone git and download ofed file (MLNX_OFED_LINUX-4.6-1.0.1.1-rhel7.4-x86_64.tgz) and place it to /export/site-roll/rocks/src/mlnx-ofed-roll/src/mlnx-ofed-linux after change version.mk file as below root@mnode mlnx-ofed-linux]# cat version.mk NAME = mlnx-ofed-linux VERSION = 4.6 RELEASE = 1.0.1.1 DISTRO = rhel7.4 EXTRA = $(DISTRO)-x86_64 PKGROOT = /opt/mlnx-ofed-linux SRC_SUBDIR = mlnx-ofed-linux SOURCE_NAME = MLNX_OFED_LINUX SOURCE_SUFFIX = tgz SOURCE_VERSION = $(VERSION)-$(RELEASE)-$(EXTRA) SOURCE_PKG = $(SOURCE_NAME)-$(SOURCE_VERSION).$(SOURCE_SUFFIX) SOURCE_PKG_EXT = $(SOURCE_NAME)-$(SOURCE_VERSION)-ext.$(SOURCE_SUFFIX) SOURCE_DIR = $(SOURCE_NAME)-$(SOURCE_VERSION) SOURCE_DIR_EXT = $(SOURCE_NAME)-$(SOURCE_VERSION)-ext TGZ_PKGS = $(SOURCE_PKG) RPM.EXTRAS = AutoReq:No

after i am build ------ make default 2>&1 | tee build.log error comes

build

RPM build errors: Bad exit status from /var/tmp/rpm-tmp.fezdtw (%build) make[2]: [rpm] Error 1 make[2]: Leaving directory `/state/partition1/site-roll/rocks/src/mlnx-ofed-roll/src/mlnx-ofed-linux' make[1]: [rpm] Error 2 make[1]: Leaving directory `/state/partition1/site-roll/rocks/src/mlnx-ofed-roll/src' make: *** [rpms] Error 2

Please help me

thanks Sandip

tcooper commented 5 years ago

Please look closely at the steps in the Alternate Versions section of the updated README.md.

It appears you may have missed the step where the binary_hashes file is updated with file size, hash and filename of your selected MLNX_OFED_LINUX source tar archive using the gen_hash.sh script.

The link to download that script from the sdsc/skeleton-roll on Github is provided in that section.

sandipnetcom commented 5 years ago

Hello, Thanks for your time and support, yes you are right I miss this step now I am building my rocks roll but the issue still same roll not install

[root@mnode mlnx-ofed-roll]# rocks list roll NAME VERSION ARCH ENABLED base: 7.0 x86_64 yes
CentOS: 7.4.1708 x86_64 yes
core: 7.0 x86_64 yes
ganglia: 7.0 x86_64 yes
hpc: 7.0 x86_64 yes
htcondor: 8.6.8 x86_64 yes
kernel: 7.0 x86_64 yes
python: 7.0 x86_64 yes
sge: 7.0 x86_64 yes
Updates-CentOS-7.4.1708: 2017-12-01 x86_64 yes
cuda: 7.0 x86_64 yes
slurm: 7.0.0.220 x86_64 yes
sdsc: 7.0 x86_64 yes
mlnx-ofed-4.6-1.0.1.1-3.10.0-693.5.2.el7: 7.0 x86_64 yes
[root@mnode mlnx-ofed-roll]# rocks run roll mlnx-ofed-4.6-1.0.1.1-3.10.0-693.5.2.el7 | bash Loaded plugins: fastestmirror, langpacks, nvidia Cleaning repos: Rocks-7.0 Cleaning up everything Maybe you want: rm -rf /var/cache/yum, to also free up space taken by orphaned data from disabled or removed repos Cleaning up list of fastest mirrors. [root@mnode mlnx-ofed-roll]#

kindly help me how to resolve this

Thanks Sandip

tcooper commented 5 years ago

Did you do what I suggested in my first response and implement the changes in the version.mk file or, alternately (and I did not suggest this), checkout the branch with the fix before attempting to build?

sandipnetcom commented 5 years ago

hello,

thanks for your support yesterday I am deleted everything and re-clone but I don't know why version.mk is not updating today I update version.mk after that rebuild iso remove and add roll again then I am again install

but today I am found new error

Error: Package: 1:compat-dapl-devel-1.2.19-4.el7.x86_64 (@anaconda/7.0) Requires: libdaplscm.so.1()(64bit) Removing: 1:compat-dapl-1.2.19-4.el7.x86_64 (@anaconda/7.0) libdaplscm.so.1()(64bit) Obsoleted By: mlnx-ofed-all-user-only-4.6-1.0.1.1.skip.distro.check.noarch (Rocks-7.0) Not found Error: Package: openmpi-devel-1.10.6-2.el7.x86_64 (@anaconda/7.0) Requires: liboshmem.so.8()(64bit) Removing: openmpi-1.10.6-2.el7.x86_64 (@anaconda/7.0) liboshmem.so.8()(64bit) Updated By: openmpi-4.0.2a1-1.46101.x86_64 (Rocks-7.0) ~liboshmem.so.40()(64bit) Error: Package: libfabric-1.4.2-1.el7.i686 (Rocks-7.0) Requires: librdmacm.so.1(RDMACM_1.0) Error: Package: openmpi-devel-1.10.6-2.el7.x86_64 (@anaconda/7.0) Requires: libmpi_mpifh.so.12()(64bit) Removing: openmpi-1.10.6-2.el7.x86_64 (@anaconda/7.0) libmpi_mpifh.so.12()(64bit) Updated By: openmpi-4.0.2a1-1.46101.x86_64 (Rocks-7.0) ~libmpi_mpifh.so.40()(64bit) Error: Package: openmpi-devel-1.10.6-2.el7.x86_64 (@anaconda/7.0) Requires: libmpi_usempi.so.5()(64bit) Removing: openmpi-1.10.6-2.el7.x86_64 (@anaconda/7.0) libmpi_usempi.so.5()(64bit) Updated By: openmpi-4.0.2a1-1.46101.x86_64 (Rocks-7.0) ~libmpi_usempi.so.40()(64bit) Error: Package: openmpi-1.10.6-2.el7.i686 (Rocks-7.0) Requires: libibverbs.so.1 Error: Package: libfabric-1.4.2-1.el7.i686 (Rocks-7.0) Requires: libibverbs.so.1 Error: Package: 1:compat-dapl-devel-1.2.19-4.el7.x86_64 (@anaconda/7.0) Requires: libdaplcma.so.1()(64bit) Removing: 1:compat-dapl-1.2.19-4.el7.x86_64 (@anaconda/7.0) libdaplcma.so.1()(64bit) Obsoleted By: mlnx-ofed-all-user-only-4.6-1.0.1.1.skip.distro.check.noarch (Rocks-7.0) Not found Error: Package: libfabric-1.4.2-1.el7.i686 (Rocks-7.0) Requires: librdmacm.so.1 Error: Package: openmpi-1.10.6-2.el7.i686 (Rocks-7.0) Requires: libibverbs.so.1(IBVERBS_1.0) Error: Package: openmpi-1.10.6-2.el7.i686 (Rocks-7.0) Requires: librdmacm.so.1 Error: Package: libfabric-1.4.2-1.el7.i686 (Rocks-7.0) Requires: libibverbs.so.1(IBVERBS_1.1) Error: Package: openmpi-1.10.6-2.el7.i686 (Rocks-7.0) Requires: libibverbs.so.1(IBVERBS_1.1) Error: Package: openmpi-1.10.6-2.el7.i686 (Rocks-7.0) Requires: libosmcomp.so.3 Error: Package: libfabric-1.4.2-1.el7.i686 (Rocks-7.0) Requires: libibverbs.so.1(IBVERBS_1.0) Error: Package: openmpi-1.10.6-2.el7.i686 (Rocks-7.0) Requires: librdmacm.so.1(RDMACM_1.0) Error: Package: 1:compat-dapl-devel-1.2.19-4.el7.x86_64 (@anaconda/7.0) Requires: compat-dapl = 1:1.2.19-4.el7 Removing: 1:compat-dapl-1.2.19-4.el7.x86_64 (@anaconda/7.0) compat-dapl = 1:1.2.19-4.el7 Obsoleted By: mlnx-ofed-all-user-only-4.6-1.0.1.1.skip.distro.check.noarch (Rocks-7.0) Not found Error: Package: 1:compat-dapl-devel-1.2.19-4.el7.x86_64 (@anaconda/7.0) Requires: libdat.so.1()(64bit) Removing: 1:compat-dapl-1.2.19-4.el7.x86_64 (@anaconda/7.0) libdat.so.1()(64bit) Obsoleted By: mlnx-ofed-all-user-only-4.6-1.0.1.1.skip.distro.check.noarch (Rocks-7.0) Not found You could try using --skip-broken to work around the problem You could try running: rpm -Va --nofiles --nodigest /etc/rc.d/rocksconfig.d/RCS/post-97-mlnx-ofed-tuning,v <-- /etc/rc.d/rocksconfig.d/post-97-mlnx-ofed-tuning initial revision: 1.1 done /etc/rc.d/rocksconfig.d/RCS/post-97-mlnx-ofed-tuning,v --> /etc/rc.d/rocksconfig.d/post-97-mlnx-ofed-tuning revision 1.1 (locked) done touch: cannot touch ‘/etc/infiniband/openib.conf’: No such file or directory mkdir: cannot create directory ‘/etc/infiniband/RCS’: No such file or directory chown: cannot access ‘/etc/infiniband/RCS’: No such file or directory ci: /etc/infiniband/RCS/openib.conf: No such file or directory co: /etc/infiniband/RCS/openib.conf,v: No such file or directory /tmp/tmpj9DgMV: line 65: /etc/infiniband/openib.conf: No such file or directory /bin/cp: cannot stat ‘/etc/infiniband/openib.conf’: No such file or directory [root@mnode install]#

Please help me i know every step i am asking to you please help me i am stack.

Thanks Sandip

tcooper commented 5 years ago

Unfortunately there is not enough context in your comment to determine what commands you ran and what may have gone wrong although it appears that install of RPMs may have been attempted suggesting you've been more successful than before.

I suggest you repeat your steps saving all commands and output and put it all in a Gist.

Reply here with a link to the Gist and I can have a look.

Please provide the output of the following commands as well...

From your frontend and the system you are building the mlnx-ofed-linux roll on (if different)...

# hostname -s
# cat /etc/{rocks,redhat}*release
# uname -a
# rpm -qa | grep -iE "ofed"
# lsmod 2>&1 | awk '/^mlx/ {print $1}' | xargs modinfo -F filename

From the frontend of your system only...

# rocks list roll | grep -Ei "name|kernel|base|core|cent|mlnx|sdsc"
# rocks list host | grep -iE "membership|frontend|devel"
sandipnetcom commented 5 years ago

Hello Thanks for your time and support I am clean up and reinstall all kindly check i am creating 2 files one is built log and another is reinstall ofed
"https://gist.github.com/sandipnetcom/ea6cbaa3e5f74f584b2249342d43e319.js

[root@mnode ~]# hostname -s mnode [root@mnode ~]# cat /etc/{rocks,redhat}*release Rocks release 7.0 (Manzanita) CentOS Linux release 7.4.1708 (Core) [root@mnode ~]# uname -a Linux mnode.nml.local 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux [root@mnode ~]# rpm -qa | grep -iE "ofed" [root@mnode ~]# [root@mnode ~]# lsmod 2>&1 | awk '/^mlx/ {print $1}' | xargs modinfo -F filename /lib/modules/3.10.0-693.5.2.el7.x86_64/kernel/drivers/infiniband/hw/mlx4/mlx4_ib.ko.xz /lib/modules/3.10.0-693.5.2.el7.x86_64/kernel/drivers/net/ethernet/mellanox/mlx4/mlx4_en.ko.xz /lib/modules/3.10.0-693.5.2.el7.x86_64/kernel/drivers/net/ethernet/mellanox/mlx4/mlx4_core.ko.xz [root@mnode ~]# [root@mnode ~]# lsmod 2>&1 | awk '/^mlx/ {print $1}' | xargs modinfo -F filename /lib/modules/3.10.0-693.5.2.el7.x86_64/kernel/drivers/infiniband/hw/mlx4/mlx4_ib.ko.xz /lib/modules/3.10.0-693.5.2.el7.x86_64/kernel/drivers/net/ethernet/mellanox/mlx4/mlx4_en.ko.xz /lib/modules/3.10.0-693.5.2.el7.x86_64/kernel/drivers/net/ethernet/mellanox/mlx4/mlx4_core.ko.xz [root@mnode ~]# rocks list roll | grep -Ei "name|kernel|base|core|cent|mlnx|sdsc" NAME VERSION ARCH ENABLED base: 7.0 x86_64 yes
CentOS: 7.4.1708 x86_64 yes
core: 7.0 x86_64 yes
kernel: 7.0 x86_64 yes
Updates-CentOS-7.4.1708: 2017-12-01 x86_64 yes
sdsc: 7.0 x86_64 yes
mlnx-ofed-4.6-1.0.1.1-3.10.0-957.27.2.el7: 7.0 x86_64 yes
[root@mnode ~]# rocks list host | grep -iE "membership|frontend|devel" HOST MEMBERSHIP CPUS RACK RANK RUNACTION INSTALLACTION mnode: Frontend 24 0 0 os install
[root@mnode ~]#

Kindly help me

Thanks Sandip

sandipnetcom commented 5 years ago

Hello,

kindly help me please.

Thanks Sandip

tcooper commented 5 years ago

Looking at your output this appears to be an Linux install issue and is no longer a roll-build issue. That's good news!

You appear to be attempting to run the roll on your cluster frontend. While this is possible in many cases in your particular case OpenMPI related RPMs (among others) are blocking the install. This is caused by MLNX_OFED_LINUX RPMs obsoleting distro provided RPMs the OpenMPI RPMs depend on. It's likely removal of the OpenMPI RPMs from your fronted may allow the install to complete.

The mlnx-ofed-linux roll is constructed to support install of the MLNX_OFED_LINUX software stack on Rocks systems during kickstart. The roll doesn't always cleanly install on nodes with distro provided Infiniband stack and/or previous versions of this roll. While I have replicated removal / install steps for a similar version of CentOS and mlnx-ofed-linux roll I don't have a matching system to be able to further debug installation specific issues you are having.

Please try reinstalling a compute node which has an Infiniband card to verify that the RPMs from mlnx-ofed-linux roll are being installed correctly. Assuming that is successful, and I have every reason to believe it will be, you can move back to working on the update of your frontend.

This may require removal of all Infiniband related software in order for the install to complete and, depending on the services provided by your frontend, may affect the Infiniband network in your entire system. You should plan accordingly.

Further assistance with correctly installing mlnx-ofed-linux should be requested on the Rocks Email Discussion List. More information about the Rocks Mailing List can be found here.

sandipnetcom commented 5 years ago

hello, Thanks for your support I am trying to reinstall compute node while OFED roll enables but its showing error during installation but disable roll OFED compute node successfully installed.

I am still in troubles I need to install OFED can you help me how to install in rocks cluster

thanks

tcooper commented 5 years ago

Please take support requests for use of mlnx-ofed-roll to the Rocks Mailing List.