ibmdb / python-ibmdb

Automatically exported from code.google.com/p/ibm-db
Apache License 2.0
304 stars 191 forks source link

Second connection to HADR virtual IP hung #647

Closed mikegong closed 2 years ago

mikegong commented 3 years ago

ibm_db.connect was hung in its second connection. Tested to connect to the same HADR primary server, first connection use host name, second connection use IP address, or vise versa, worked. If use the same host name or same IP address in two connections, it did not work. This problem only happened with HADR database. There is no problem to connect to a standalone database repeatedly.

It appeared that something was cached and messed things up. Is there anyway to workaround this? Thanks.

imavo commented 3 years ago

Maybe your problem is nothing to do with python, but instead linked your HADR and clidriver/connection-string configuration?

Please provide the following details:

(1) Has the DBA correctly enabled automatic client reroute on both the primary and secondaries? Give details how that configuration was performed.

(2) How have you verified that the HADR failover/takeover correctly moved the virtual IP-address/hostname to the standby?

(3) What is the version of your clidriver for python ibm_db (or Db2 driver) - that is to say, what is the text output of its db2level command on the hostname where python runs?

(4) Is your Db2-server configuration a purescale environment?

(5) On your client workstation where python runs, are you using a DSN (configured in db2dsdriver.cfg), or are you hardcoding the connection-string in the python code in the ibm_db_connect() ?

(6) in your db2dsdriver.cfg on the client, or in your connection-string in python, have you set QueryTimeout, or any other ACR related configuration setting?

The Db2-LUW documentation describes at least eight parameters (for db2dsdriver.cfg or connection string) that control how the automatic client reroute behaves for non-java clients (i.e. for cli, like python). Pay attention to their default values, which might differ from your assumptions, and test different settings in your environment.

See the documentation here https://www.ibm.com/docs/en/db2/11.5?topic=njcshacd-configuration-db2-automatic-client-reroute-support-applications-other-than-java

mikegong commented 3 years ago

Maybe your problem is nothing to do with python, but instead linked your HADR and clidriver/connection-string configuration?

Please provide the following details:

(1) Has the DBA correctly enabled automatic client reroute on both the primary and secondaries? Give details how that configuration was performed.

(2) How have you verified that the HADR failover/takeover correctly moved the virtual IP-address/hostname to the standby?

(3) What is the version of your clidriver for python ibm_db (or Db2 driver) - that is to say, what is the text output of its db2level command on the hostname where python runs?

(4) Is your Db2-server configuration a purescale environment?

(5) On your client workstation where python runs, are you using a DSN (configured in db2dsdriver.cfg), or are you hardcoding the connection-string in the python code in the ibm_db_connect() ?

(6) in your db2dsdriver.cfg on the client, or in your connection-string in python, have you set QueryTimeout, or any other ACR related configuration setting?

The Db2-LUW documentation describes at least eight parameters (for db2dsdriver.cfg or connection string) that control how the automatic client reroute behaves for non-java clients (i.e. for cli, like python). Pay attention to their default values, which might differ from your assumptions, and test different settings in your environment.

See the documentation here https://www.ibm.com/docs/en/db2/11.5?topic=njcshacd-configuration-db2-automatic-client-reroute-support-applications-other-than-java

Hi imavo,

Thanks a lot for the information! Here is my test, using connection string, no db2dsdriver.cfg setup.

1) ibm_db.connect() to primary node with success 2) ibm_db.close() 3) ibm_db.connect() to primary node again 4) step 3 was hung

In db2diag.log on client side,

FUNCTION: DB2 UDB, common communication, sqlcctcpconnr, probe:42 MESSAGE : DIA3222E The host name "" was not found, TCP/IP function "getaddrinfo" error return code = "11001".

FUNCTION: DB2 UDB, DRDA Communication Manager, sqljcCommConnect, probe:10 MESSAGE : ZRC=0x81360012=-2127167470=SQLZ_RC_CMERR, SQLT_SQLJC "External Comm error" DATA #1 : String, 11 bytes CCI Error: DATA #2 : unsigned integer, 4 bytes 97

FUNCTION: DB2 UDB, common communication, sqlcctcplogIPAddress, probe:50 MESSAGE : DIA3220W Client sent data from IP address = "", port = "25066".

FUNCTION: DB2 UDB, DRDA Application Requester, sqljrRetrySetup, probe:14 MESSAGE : ZRC=0x8037006D=-2143879059=SQLJR_CABLT "CA BUILT" DATA #1 : String, 30 bytes Client Reroute is starting....

FUNCTION: DB2 UDB, DRDA Application Requester, sqljrRetrySetup, probe:15 DATA #1 : String, 50 bytes The correlation token of the failed connection is DATA #2 : String, 32 bytes

FUNCTION: DB2 UDB, DRDA Application Requester, sqljrListToConnect, probe:20 DATA #1 : String, 40 bytes Reconnecting to Hostname/IP Address --> DATA #2 : String, 13 bytes

DATA #3 : String, 45 bytes Reconnecting to Service name/Port number --> DATA #4 : String, 5 bytes FUNCTION: DB2 UDB, common communication, sqlcctcpconnr, probe:42 MESSAGE : DIA3222E The host name "" was not found, TCP/IP function "getaddrinfo" error return code = "11001". I tried to disable ACR using db2dsdriver.cfg and still got problem. BTW, in the cached srvrlst.xml, I saw domain name of primary node was missing in alternate server list. Most likely the problem is the configured ALTERNATE SERVER on standby node is missing domain name. However, the behavior of ibm_db is still questionable. Why it tried to connect to standby node when the code actually wanted to connect to primary node even if connection to primary node was not a problem? Is there any chance to clean up cached alternate server list? Thanks
imavo commented 3 years ago

ACR cannot operate properly when the Db2-server configuration is inadequate, and Db2-client configuration is inadquate, and no workstation specific workaround is specified.

See the technote https://www.ibm.com/support/pages/db2-automatic-client-reroute-alternate-server-list-srvrlstxml-contains-incorrect-or-unresolvable-hostname

This is not a matter for python ibm_db, it is purely a configuration matter for the Db2-server and Db2-client configuration.

You can prove this by temporarily removing python-and-ibm_db by recreating your test scenario with the db2cli tooling - which can run a connection test at the command line for properly configured clidriver db2dsdriver.cfg configurations.

mikegong commented 3 years ago

I tested Db2 native connection by setting up catalog, and JDBC connection with connection string. There was no problem. There was also no problem with python ibm_db without Db2 client.

imavo commented 3 years ago

jdbc does not share any code with any ibm supplied CLI client. python ibm_db will not use a jdbc driver , although bridging software can force this to happen.

Consequently for the purposes of this question, the jdbc behaviour is irrelevant for CLI driver operation.

The python ibm_db module requires a Db2 client to communicate with Db2, and will not connect withou a CLI interface provider (e.g a CLI driver). For that reason, python ibm_db installs by default the clidriver (a tiny footprint Db2 client ), although python ibm_db can use any comapatible CLI driver if so configured , from IBM or third parties.

mikegong commented 3 years ago

More tests show this hung problem occurred on AIX but not on RedHat Linux. This hung problem doesn't happen with JDBC driver either. Kindly please consider removing hung from ibm_db driver. A workaround could be adding a config parameter to disable ACR from ibm_db connection. BTW, ibm_db connection crashed on AIX with Db2 CLI driver. Thanks!

amukherjee28 commented 3 years ago

Hi @mikegong

I have tested the HADR failover scenario using python application and were able to complete my tests without any error.

Please find below the details of my analysis.

APPLICATION CODE

import ibm_db
import time

print("IBM_DB_VERSION : " + str(ibm_db.__version__))
print("=> Creating Connection")

#conn = ibm_db.connect("DATABASE=HADRDB;HOSTNAME=HADRHOSTNAME;PORT=PORT;UID=USER;PWD=PASSWORD;PROTOCOL=TCPIP",'','')
conn = ibm_db.connect("DSN=MEM1;UID=USER;PWD=PASSWORD",'','')

print("=> Creating Tables")

try:
    res = ibm_db.exec_immediate(conn,"DROP TABLE HADR_TEST")
except:
    print("Already dropped")

create_stmt = "Create TABLE HADR_TEST (c1 int)"
print("=> RUNNING : " + create_stmt)
try:
    result = ibm_db.exec_immediate(conn,create_stmt)
    print(result)
except:
    #result = ibm_db.exec_immediate(conn,"DROP TABLE hadr_test")
    #result = ibm_db.exec_immediate(conn,create_stmt)
    print("=> Table exists....re-created")

insert_val = "insert into hadr_test values (1)"
print("=> RUNNING : " + insert_val)
try:
    ibm_db.exec_immediate(conn,insert_val)
except Exception as e:
    print(str(e))
insert_val = "insert into hadr_test values (2)"
print("=> RUNNING : " + insert_val)
try:
    ibm_db.exec_immediate(conn,insert_val)
except Exception as e:
    print(str(e))
insert_val = "insert into hadr_test values (3)"
print("=> RUNNING : " + insert_val)
try:
    ibm_db.exec_immediate(conn,insert_val)
except Exception as e:
    print(str(e))

time.sleep(40)

insert_val = "insert into hadr_test values (4)"
print("=> RUNNING : " + insert_val)
try:
    ibm_db.exec_immediate(conn,insert_val)
except Exception as e:
    print(str(e))
insert_val = "insert into hadr_test values (5)"
print("=> RUNNING : " + insert_val)
try:
    ibm_db.exec_immediate(conn,insert_val)
except Exception as e:
    print(str(e))
insert_val = "insert into hadr_test values (6)"
print("=> RUNNING : " + insert_val)
try:
    ibm_db.exec_immediate(conn,insert_val)
except Exception as e:
    print(str(e))

print("=> Getting field type of the table")

res = "select * from hadr_test"

exe = ibm_db.exec_immediate(conn,res)

row = ibm_db.fetch_tuple(exe)
while row:
    print(row[0])
    row = ibm_db.fetch_tuple(exe)

SCENARIO 1

In this case I have configured the alternate server list in the db2dsdriver.cfg file and used the alias name to create connection to the active member of the HADR. Once the first connection is established, during the sleep time in the application code I am switching the HADR members roles and once the application comes out from the sleep I could see the FAILOVER happening and all workloads are re-routed via the next member.

CFG FILE USE

<configuration>
<dsncollection>
<dsn alias="MEM1" name="HADRDB" host="HOSTNAME1" port="PORT1"/>
</dsncollection>
<databases>
   <database name="HADRDB" host="HOSTNAME1" port="PORT1">
        <acr>
         <parameter name="enableACR" value="true"/>
         <parameter name="enableSeamlessACR" value="false"/>
         <alternateserverlist>
                <server name="S1" hostname="HOSTNAME1" port="PORT1"/>
                <server name="S2" hostname="HOSTNAME2" port="PORT2"/>
         </alternateserverlist>
         <affinitylist>
                <list name="list1" serverorder="S1,S2"/>
         </affinitylist>
         <clientaffinitydefined>
                <client name="client1" hostname="localhost" listname="list1"/>
         </clientaffinitydefined>
      </acr>
   </database>
</databases>
<parameters>
</parameters>
</configuration>

OUTPUT

python3 hadr1.py
IBM_DB_VERSION : 3.0.4
=> Creating Connection
=> Creating Tables
=> RUNNING : Create TABLE HADR_TEST (c1 int)
<ibm_db.IBM_DBStatement object at 0x7f195548c3b0>
=> RUNNING : insert into hadr_test values (1)
=> RUNNING : insert into hadr_test values (2)
=> RUNNING : insert into hadr_test values (3)
=> RUNNING : insert into hadr_test values (4)
[IBM][CLI Driver][DB2/NT64] SQL30108N  A connection failed in an automatic client reroute environment. The transaction was rolled back. Host name or IP address: "HOSTNAME". Service name or port number: "50000". Reason code: "1". Connection failure code: "2". Underlying error: "*".  SQLSTATE=08506 SQLCODE=-30108
=> RUNNING : insert into hadr_test values (5)
=> RUNNING : insert into hadr_test values (6)
=> Getting field type of the table
1
2
3
5
6

SCENARIO 2

In this case the alternate server is configured as part of the server configuration using the command

db2 update alternate server for database DATABASE_NAME using hostname HOSTNAME2 port PORT2

So the failover details are already configured as part of the server configuration.

Hence in this case the application used full connection string for making the first connection and during the sleep the member role was switched.

Trace shows the failover happened during the role switch.

SQLExecDirectW( hStmt=1:3, pszSqlStr="insert into hadr_test values (4)", cbSqlStr=-3 )
    ---> Time elapsed - +0.000000E+000 seconds
( StmtOut="insert into hadr_test values (4)" )
( Package="SYSSH200          ", Section=65 )
( ROLLBACK REPLY RECEIEVED=1 )
( ROLLBACK REPLY RECEIEVED=1 )
( Unretrieved error message="[IBM][CLI Driver][DB2/NT64] SQL30108N  A connection failed in an automatic client reroute environment. The transaction was rolled back. Host name or IP address: "HOSTNAME". Service name or port number: "PORT". Reason code: "1". Connection failure code: "2". Underlying error: "*".  SQLSTATE=08506
" )
( Unretrieved error message="[IBM][CLI Driver][DB2/NT64] SQL30108N  A connection failed in an automatic client reroute environment. The transaction was rolled back. Host name or IP address: "HOSTNAME". Service name or port number: "PORT". Reason code: "1". Connection failure code: "2". Underlying error: "*".  SQLSTATE=08506
" )
( StmtOut="insert into hadr_test values (4)" )
( Package="SYSSH200          ", Section=65 )
( COMMIT REPLY RECEIVED=1 )

SQLExecDirectW( )
    <--- SQL_SUCCESS   Time elapsed - +0.000000E+000 seconds

OUTPUT

 python3 hadr1.py
IBM_DB_VERSION : 3.0.4
=> Creating Connection
=> Creating Tables
=> RUNNING : Create TABLE HADR_TEST (c1 int)
<ibm_db.IBM_DBStatement object at 0x7f7bd88b73b0>
=> RUNNING : insert into hadr_test values (1)
=> RUNNING : insert into hadr_test values (2)
=> RUNNING : insert into hadr_test values (3)
=> RUNNING : insert into hadr_test values (4)
=> RUNNING : insert into hadr_test values (5)
=> RUNNING : insert into hadr_test values (6)
=> Getting field type of the table
1
2
3
4
5
6

Please let me know if the above explanation is in par with your application handling as well.

Thanks

amukherjee28 commented 3 years ago

Is there any update on this case.

In case the explanation provided above work can this issue be closed?

amukherjee28 commented 2 years ago

Closing on the case as there is no further update on this. Explanation with example provided.