oraclebase / vagrant

Vagrant Builds
https://oracle-base.com/
GNU General Public License v3.0
190 stars 167 forks source link

RAC Install issue stuck at rpm part #24

Closed AJKGitHub21 closed 3 years ago

AJKGitHub21 commented 3 years ago

Hello Guys,

git clone https://github.com/oraclebase/vagrant.git

after following the instruction for RAC installation(DNS Completed, node2-stuck, node1-pending), it gets stuck at the below point for an infinite time. Can you please help me?

**### default:   Installing : 32:bind-utils-9.11.4-26.P2.el7_9.5.x86_64                  36/37**

default: Transaction test succeeded default: Running transaction default: Installing : 32:bind-license-9.11.4-26.P2.el7_9.5.noarch 1/37 default: Installing : libXau-1.0.8-2.1.el7.x86_64 2/37 default: Installing : libxcb-1.13-1.el7.x86_64 3/37 default: Installing : libaio-0.3.109-13.el7.x86_64 4/37 default: Installing : libICE-1.0.9-9.el7.x86_64 5/37 default: Installing : libSM-1.2.2-2.el7.x86_64 6/37 default: Installing : libaio-devel-0.3.109-13.el7.x86_64 7/37 default: Installing : compat-libstdc++-33-3.2.3-72.el7.x86_64 8/37 default: Installing : compat-libcap1-1.10-7.el7.x86_64 9/37 default: Installing : ksh-20120801-142.0.1.el7.x86_64 10/37 default: Installing : libX11-common-1.6.7-3.el7_9.noarch 11/37 default: Installing : libX11-1.6.7-3.el7_9.x86_64 12/37 default: Installing : libXext-1.3.3-3.el7.x86_64 13/37 default: Installing : libXi-1.7.9-1.el7.x86_64 14/37 default: Installing : libXrender-0.9.10-1.el7.x86_64 15/37 default: Installing : libXrandr-1.5.1-2.el7.x86_64 16/37 default: Installing : libXtst-1.2.3-1.el7.x86_64 17/37 default: Installing : libXxf86misc-1.0.3-7.1.el7.x86_64 18/37 default: Installing : libdmx-1.1.3-3.el7.x86_64 19/37 default: Installing : libXinerama-1.1.3-2.1.el7.x86_64 20/37 default: Installing : libXv-1.0.11-1.el7.x86_64 21/37 default: Installing : libXxf86vm-1.1.4-1.el7.x86_64 22/37 default: Installing : libXxf86dga-1.1.4-2.1.el7.x86_64 23/37 default: Installing : xorg-x11-utils-7.5-23.el7.x86_64 24/37 default: Installing : libXt-1.1.5-3.el7.x86_64 25/37 default: Installing : libXmu-1.1.2-2.el7.x86_64 26/37 default: Installing : 1:xorg-x11-xauth-1.0.9-1.el7.x86_64 27/37 default: Installing : 1:smartmontools-7.0-2.el7.x86_64 28/37 default: Installing : lm_sensors-libs-3.4.0-8.20160601gitf9185e5.el7.x86_64 29/37 default: Installing : sysstat-10.1.5-19.el7.x86_64 30/37 default: Installing : libstdc++-devel-4.8.5-44.0.3.el7.x86_64 31/37 default: Installing : geoipupdate-2.5.0-1.el7.x86_64 32/37 default: Installing : GeoIP-1.5.0-14.el7.x86_64 33/37 default: Installing : 32:bind-libs-lite-9.11.4-26.P2.el7_9.5.x86_64 34/37 default: Installing : 32:bind-libs-9.11.4-26.P2.el7_9.5.x86_64 35/37 ### default: Installing : 32:bind-utils-9.11.4-26.P2.el7_9.5.x86_64 36/37

Thanks Ajay Kumar

oraclebase commented 3 years ago

Hi.

Not sure what to say on this. This is basic package installation. I have no idea why that would fail. It's doing the install, so clearly the download of the package have completed. I think the only thing that could stop it at this point is if there is no free CPU on your host, or if there is a problem with the disk on your host. Both seem unlikely, or you would have said something I guess.

I think the best I can suggest is you CTRL+C on this and do the following and see if it works the second time round.

vagrant destroy -f vagrant up

Node 2 is really simple, so it's worrying you are failing at that point. It doesn't bode well for Node 1, which has to do all the hard work of installing the system.

Cheers

Tim...

AJKGitHub21 commented 3 years ago

Hello Tim,

Thanks for your response

I have a 6core CPU and 32GB ram with SSD. yes you are right I don't see any resource crunch.

60% memory is available 85% CPU is available SSD is like 2% utlized

on node2, below are performed vagrant destroy -f vagrant up

I performed the above steps 2-3 times and was able to see the node1 and node2 however, got some errors in the log

========= default: The execution of the script is complete. default: Check /u01/app/12.2.0.1/grid/install/root_ol7-122-rac1.localdomain_2021-08-14_08-59-31-009868796.log for the output of root script default: Check /u01/app/12.2.0.1/grid/install/root_ol7-122-rac2.localdomain_2021-08-14_09-18-11-171304519.log for the output of root script default: packet_write_wait: Connection to 192.168.56.102 port 22: Broken pipe

============= /u01/app/12.2.0.1/grid/root.sh

not sure if this script was executed on node2 also or not

======== default: Checking Temp space: must be greater than 500 MB. Actual 59731 MB Passed default: Checking swap space: must be greater than 150 MB. Actual 2046 MB Passed default: Preparing to launch Oracle Universal Installer from /tmp/OraInstall2021-08-14_11-35-50AM. Please wait ...[WARNING] [INS-06009] SSH performance is detected to be slow, which may impact performance during remote node operations like copying the software and executing prerequisite checks. default: ACTION: Consider optimizing the ssh configuration. default: [FATAL] [INS-35361] One or more selected nodes are down. default: CAUSE: One or more selected nodes cannot be reached default: ACTION: Ensure that all the nodes selected for operation are reachable. default: ** default: Run DB root scripts. Sat Aug 14 11:36:10 UTC 2021 default: ** default: sh: /u01/app/oracle/product/12.2.0.1/dbhome_1/root.sh: No such file or directory default: ssh: connect to host ol7-122-rac2 port 22: No route to host default: ** default: Create database. Sat Aug 14 11:36:13 UTC 2021 default: ** default: /vagrant/scripts/oracle_create_database.sh: line 6: dbca: command not found default: ** default: Save state of PDB to enable auto-start. Sat Aug 14 11:36:13 UTC 2021 default: ** default: /vagrant/scripts/oracle_create_database.sh: line 31: sqlplus: command not found default: ** default: Check cluster configuration. Sat Aug 14 11:36:13 UTC 2021 default: **

=================

it seems there is a SSH connectivity issue came during installation

the CRS is not properly installed on node2 and as well as DB binaries are also not installed on both nodes [oracle@ol7-122-rac2 ~]$ crsctl check crs CRS-4638: Oracle High Availability Services is online CRS-4535: Cannot communicate with Cluster Ready Services CRS-4530: Communications failure contacting Cluster Synchronization Services daemon CRS-4534: Cannot communicate with Event Manager [oracle@ol7-122-rac2 ~]$

Can you please help me out to fix the [issue?] log.txt

Log is attached for reference

Thanks Ajay

oraclebase commented 3 years ago

You should not touch node 1 until node 2 is up and successful.

Destroy everything and follow the instructions in order.

AJKGitHub21 commented 3 years ago

Hello Tim,

I followed the instructions as per the readme file.

I believe you must have found something in the "log" which was not right. It will be great if you can let me know the issue.

In b/w let me try to start over.

Thanks Ajay

oraclebase commented 3 years ago

Your response is unclear. Please confirm you did the following.

Destroyed everything. Started DNS without errors. Once DNS start was complete, started node2 without errors. Once node2 start was complete, started node1.

Note. At the bottom of the README.txt it explains about possible issues with the DHCP server. I've not seen your error before, but it appears to be network related, so it's worth disabling the DHCP server as instructed. Both your original error and this one both appear to be network related.

There is really no point moving forward to node1 if node 2 has not started without error!

AJKGitHub21 commented 3 years ago

Your response is unclear. Please confirm you did the following.

Destroyed everything. Started DNS without errors. Once DNS start was complete, started node2 without errors. Once node2 start was complete, started node1.

Note. At the bottom of the README.txt it explains about possible issues with the DHCP server. I've not seen your error before, but it appears to be network related, so it's worth disabling the DHCP server as instructed. Both your original error and this one both appear to be network related.

There is really no point moving forward to node1 if node 2 has not started without error!

Yep Tim, I followed exactly the same way you described here and in readme file :)

Destroyed everything. Started DNS without errors. Once DNS start was complete, started node2 without errors. Once node2 start was complete, started node1.

Thanks Ajay

oraclebase commented 3 years ago

I don't know what to say then.

Your first error looked like it was the VM just hanging. Your second error looked like there was a networking fault. I've not encountered theses issues before and I've run this installation many time...

What I would try next is:

This is clutching at straws because I don't know why you are having these problems.

AJKGitHub21 commented 3 years ago

Your response is unclear. Please confirm you did the following. Destroyed everything. Started DNS without errors. Once DNS start was complete, started node2 without errors. Once node2 start was complete, started node1. Note. At the bottom of the README.txt it explains about possible issues with the DHCP server. I've not seen your error before, but it appears to be network related, so it's worth disabling the DHCP server as instructed. Both your original error and this one both appear to be network related. There is really no point moving forward to node1 if node 2 has not started without error!

Yep Tim, I followed exactly the same way you described here and in readme file :)

Destroyed everything. Started DNS without errors. Once DNS start was complete, started node2 without errors. Once node2 start was complete, started node1.

Thanks Ajay

Hello Tim,

I followed the steps again as per readme.


default: Run grid root scripts. Sat Aug 14 19:45:03 UTC 2021
default: ******************************************************************************
default: Changing permissions of /u01/app/oraInventory.
default: Adding read,write permissions for group.
default: Removing read,write,execute permissions for world.
default:
default: Changing groupname of /u01/app/oraInventory to oinstall.
default: The execution of the script is complete.
default: Changing permissions of /u01/app/oraInventory.
default: Adding read,write permissions for group.
default: Removing read,write,execute permissions for world.
default:
default: Changing groupname of /u01/app/oraInventory to oinstall.
default: The execution of the script is complete.

default: Check /u01/app/12.2.0.1/grid/install/root_ol7-122-rac1.localdomain_2021-08-14_19-45-03-429277002.log for the output of root script default: Check /u01/app/12.2.0.1/grid/install/root_ol7-122-rac2.localdomain_2021-08-14_20-04-41-855430194.log for the output of root script default: packet_write_wait: Connection to 192.168.56.102 port 22: Broken pipe ===> Problem seems start here <==== default: ** default: Do grid configuration. Sat Aug 14 22:17:40 UTC 2021 default: **

default: packet_write_wait: Connection to 192.168.56.102 port 22: Broken pipe => at this point there is some connection issue comes.

logNode1.txt logNode2.txt dnsLog.txt logof2files.txt => root_ol7-122-rac1.localdomain_2021-08-14_19-45-03-429277002.log and /u01/app/12.2.0.1/grid/install/root_ol7-122-rac2.localdomain_2021-08-14_20-04-41-855430194.log

During installation, I "disabled" the DHCP in the n/w interface

could you please have a look?

attached logs for reference

Thanks Ajay Kumar

oraclebase commented 3 years ago

Looking at the log for node1, I see this.

default: ******************************************************************************
default: Run grid root scripts. Sat Aug 14 19:45:03 UTC 2021
default: ******************************************************************************
default: Changing permissions of /u01/app/oraInventory.
default: Adding read,write permissions for group.
default: Removing read,write,execute permissions for world.
default:
default: Changing groupname of /u01/app/oraInventory to oinstall.
default: The execution of the script is complete.
default: Changing permissions of /u01/app/oraInventory.
default: Adding read,write permissions for group.
default: Removing read,write,execute permissions for world.
default:
default: Changing groupname of /u01/app/oraInventory to oinstall.
default: The execution of the script is complete.
default: Check /u01/app/12.2.0.1/grid/install/root_ol7-122-rac1.localdomain_2021-08-14_19-45-03-429277002.log for the output of root script
default: Check /u01/app/12.2.0.1/grid/install/root_ol7-122-rac2.localdomain_2021-08-14_20-04-41-855430194.log for the output of root script
default: packet_write_wait: Connection to 192.168.56.102 port 22: Broken pipe

So at the point where it is trying to run the root scripts on node2 after the grid installation, there is a network failure. If you look at the output below that, you can see the cluster services are only running on node1, not node2.

It seems at some point during the installation process your node2 is either dying, or dropping off the network. It would suggest there is something wrong with VirtualBox on your host, or something wrong with the host machine itself. There is nothing I can do about this.

AJKGitHub21 commented 3 years ago

Thanks Tim,

Let me try fix that issue, I am sure there is something that is not right when ssh connection is being made to node2.

Thanks Ajay

oraclebase commented 3 years ago

FYI: I spent yesterday running builds for 21c, including RAC builds for OL7 and OL8. Both worked fine. I based those on the ol7_19 and ol8_19 builds.

AJKGitHub21 commented 3 years ago

FYI: I spent yesterday running builds for 21c, including RAC builds for OL7 and OL8. Both worked fine. I based those on the ol7_19 and ol8_19 builds.

Ok great, however, I am still diagnosing the issue :( :(

oraclebase commented 3 years ago

I only mentioned it to show that the approach is working consistently for me, so it's not like a general flaw in the process...

AJKGitHub21 commented 3 years ago

I only mentioned it to show that the approach is working consistently for me, so it's not like a general flaw in the process...

Yep, I understand that the process is working. I guess I may have hit the virtual box bug or something. I upgraded to Version 6.1.26 r145957 (Qt5.6.2) and things went fine.

Thanks again..!

Regards Ajay Kumar

oraclebase commented 3 years ago

Great. Thanks for the feedback.