oracle / odpi

ODPI-C: Oracle Database Programming Interface for Drivers and Applications
https://oracle.github.io/odpi/
Other
264 stars 75 forks source link

core dumps generated in libclntsh when timeouts are configured #165

Open quidstone opened 2 years ago

quidstone commented 2 years ago
  1. What version of ODPI-C are you using (see dpi.h)? the version is 4.2.1
  2. What OS (and version) is your application on? Red Hat Enterprise Linux Server release 7.9
  3. What compiler (and version) did you use? gcc (GCC) 4.8.5
  4. What is your version of the Oracle Client (e.g. Instant Client)? How was it installed? Where it is installed? 19.11, I installed it as an rpm. installed location is /usr/lib/oracle/19.11/client64
  5. What is your Oracle Database version? 11.2
  6. What is the PATH environment variable (on Windows) or LD_LIBRARY_PATH (on Linux) set to? LD_LIBRARY_PATH is set to the odpi-c lib location
  7. What environment variables did you set? How exactly did you set them? application is running as a service.
  8. What problem or error(s) you are seeing? We have added timeouts is sqlnet.ora as we face db space issue for a partitioned table quite a lot. After adding that whenever the issue occurs, every time we get few core dumps. I am sharing both core dump backtrace and sqlnet.ora file entry. backtrace:(some of the stack is omitted)
    Program terminated with signal 11, Segmentation fault.
    #0  0x00007fe09b68bd00 in pthread_mutex_lock () from /lib64/libpthread.so.0
    Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 elfutils-libelf-0.176-5.el7.x86_64 elfutils-libs-0.176-5.el7.x86_64 glibc-2.17-324.el7_9.x86_64 libaio-0.3.109-13.el7.x86_64 libattr-2.4.46-13.el7.x86_64 libcap-2.22-11.el7.x86_64 libgcc-4.8.5-44.el7.x86_64 numactl-libs-2.0.12-5.el7.x86_64 systemd-libs-219-78.el7_9.3.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-19.el7_9.x86_64
    (gdb) bt
    #0  0x00007fe09b68bd00 in pthread_mutex_lock () from /lib64/libpthread.so.0
    #1  0x00007fe097d59323 in kpucopyuserhndl () from /usr/lib/oracle/19.11/client64/lib/libclntsh.so
    #2  0x00007fe097df8142 in kpuspextend () from /usr/lib/oracle/19.11/client64/lib/libclntsh.so
    #3  0x00007fe097dfa5ac in kpuspgetfreesession () from /usr/lib/oracle/19.11/client64/lib/libclntsh.so
    #4  0x00007fe097df6f6f in kpuspgetpooledsession () from /usr/lib/oracle/19.11/client64/lib/libclntsh.so
    #5  0x00007fe097df304d in kpuspsessionget () from /usr/lib/oracle/19.11/client64/lib/libclntsh.so
    #6  0x00007fe097b33d01 in OCISessionGet () from /usr/lib/oracle/19.11/client64/lib/libclntsh.so
    #7  0x00007fe09b4623a4 in dpiOci__sessionGet (envHandle=0x23bd320, handle=handle@entry=0x7fdfbc00d680, authInfo=authInfo@entry=0x7fe0300391f0,
    connectString=connectString@entry=0x23f2088 "OCI:SP:zlx4B3WlO3fgU6hJZQrcuC", connectStringLength=connectStringLength@entry=29, tag=<optimized out>,
    tagLength=0, outTag=outTag@entry=0x7fe0707d78c8, outTagLength=outTagLength@entry=0x7fe0707d78d0, found=found@entry=0x7fe0707d78d4, mode=mode@entry=1,
    error=error@entry=0x7fe0707d7850) at src/dpiOci.c:3136
    #8  0x00007fe09b446f1c in dpiConn__getSession (error=0x7fe0707d7850, authInfo=0x7fe0300391f0, params=0x7fe0707d7870, connectStringLength=29,
    connectString=0x23f2088 "OCI:SP:zlx4B3WlO3fgU6hJZQrcuC", mode=1, conn=0x7fdfbc00d660) at src/dpiConn.c:861
    #9  dpiConn__get (error=0x7fe0707d7850, pool=0x23bd180, createParams=0x7fe0707d7870, connectStringLength=29,
    connectString=0x23f2088 "OCI:SP:zlx4B3WlO3fgU6hJZQrcuC", passwordLength=<optimized out>, password=<optimized out>, userNameLength=<optimized out>,
    userName=<optimized out>, conn=0x7fdfbc00d660) at src/dpiConn.c:636
    #10 dpiConn__create (conn=0x7fdfbc00d660, context=<optimized out>, userName=userName@entry=0x0, userNameLength=userNameLength@entry=0,
    password=password@entry=0x0, passwordLength=passwordLength@entry=0, connectString=0x23f2088 "OCI:SP:zlx4B3WlO3fgU6hJZQrcuC", connectStringLength=29,
    pool=pool@entry=0x23bd180, commonParams=commonParams@entry=0x0, createParams=createParams@entry=0x7fe0707d7870, error=error@entry=0x7fe0707d7850)
    at src/dpiConn.c:414
    #11 0x00007fe09b450d71 in dpiPool__acquireConnection (pool=pool@entry=0x23bd180, userName=userName@entry=0x0, userNameLength=userNameLength@entry=0,
    password=password@entry=0x0, passwordLength=0, params=params@entry=0x7fe0707d7870, conn=conn@entry=0x7fe0707d7958, error=error@entry=0x7fe0707d7850)
    at src/dpiPool.c:36
    #12 0x00007fe09b450f1d in dpiPool_acquireConnection (pool=pool@entry=0x23bd180, userName=userName@entry=0x0, userNameLength=userNameLength@entry=0,
    password=password@entry=0x0, passwordLength=passwordLength@entry=0, params=0x7fe0707d7870, params@entry=0x0, conn=conn@entry=0x7fe0707d7958)
    at src/dpiPool.c:341

sqlnet.ora file contents:

SQLNET.RECV_TIMEOUT=200ms
SQLNET.OUTBOUND_CONNECT_TIMEOUT=300ms
DIAG_ADR_ENABLED=OFF
DIAG_SIGHANDLER_ENABLED=FALSE
DIAG_DDE_ENABLED=FALSE
  1. Do you have a runnable code snippet to describe the problem? no
cjbj commented 2 years ago
quidstone commented 2 years ago

Fatal NI connect error 12170.

VERSION INFORMATION: TNS for Linux: Version 19.0.0.0.0 - Production TCP/IP NT Protocol Adapter for Linux: Version 19.0.0.0.0 - Production Version 19.11.0.0.0 Time: 15-OCT-2021 14:13:35 Tracing not turned on. Tns error struct: ns main err code: 12535 TNS-12535: Message 12535 not found; No message file for product=network, facility=TNS ns secondary err code: 12609 nt main err code: 0 nt secondary err code: 0 nt OS err code: 0 Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=))


- Our production is running with 11.2 and we don't have any other options right now.
quidstone commented 2 years ago

now we are getting this,

#0  0x00007f496269b4fb in raise () from /lib64/libpthread.so.0
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 elfutils-libelf-0.176-5.el7.x86_64 elfutils-libs-0.176-5.el7.x86_64 glibc-2.17-324.el7_9.x86_64 libaio-0.3.109-13.el7.x86_64 libattr-2.4.46-13.el7.x86_64 libcap-2.22-11.el7.x86_64 libgcc-4.8.5-44.el7.x86_64 numactl-libs-2.0.12-5.el7.x86_64 systemd-libs-219-78.el7_9.3.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-19.el7_9.x86_64
(gdb) bt
#0  0x00007f496269b4fb in raise () from /lib64/libpthread.so.0
#1  0x00007f495fe7e1bf in skgesigOSCrash () from /usr/lib/oracle/19.11/client64/lib/libclntsh.so
#2  0x00007f4960572e3d in kpeDbgSignalHandler () from /usr/lib/oracle/19.11/client64/lib/libclntsh.so
#3  0x00007f495fe7e492 in skgesig_sigactionHandler () from /usr/lib/oracle/19.11/client64/lib/libclntsh.so
#4  <signal handler called>
#5  0x00007f495ed62849 in kpucopyuserhndl () from /usr/lib/oracle/19.11/client64/lib/libclntsh.so
#6  0x00007f495ee02142 in kpuspextend () from /usr/lib/oracle/19.11/client64/lib/libclntsh.so
#7  0x00007f495ee045ac in kpuspgetfreesession () from /usr/lib/oracle/19.11/client64/lib/libclntsh.so
#8  0x00007f495ee00f6f in kpuspgetpooledsession () from /usr/lib/oracle/19.11/client64/lib/libclntsh.so
#9  0x00007f495edfd04d in kpuspsessionget () from /usr/lib/oracle/19.11/client64/lib/libclntsh.so
#10 0x00007f495eb3dd01 in OCISessionGet () from /usr/lib/oracle/19.11/client64/lib/libclntsh.so
#11 0x00007f496246c3a4 in dpiOci__sessionGet (envHandle=0x1685d00, handle=handle@entry=0x7f48c400d680, authInfo=authInfo@entry=0x7f4865fd7618,
    connectString=connectString@entry=0x16baa68 "OCI:SP:zpuyRFdHlCPgU6hJZQr8DC", connectStringLength=connectStringLength@entry=29, tag=<optimized out>,
    tagLength=0, outTag=outTag@entry=0x7f493f7e58c8, outTagLength=outTagLength@entry=0x7f493f7e58d0, found=found@entry=0x7f493f7e58d4, mode=mode@entry=1,
    error=error@entry=0x7f493f7e5850) at src/dpiOci.c:3136
#12 0x00007f4962450f1c in dpiConn__getSession (error=0x7f493f7e5850, authInfo=0x7f4865fd7618, params=0x7f493f7e5870, connectStringLength=29,
    connectString=0x16baa68 "OCI:SP:zpuyRFdHlCPgU6hJZQr8DC", mode=1, conn=0x7f48c400d660) at src/dpiConn.c:861
#13 dpiConn__get (error=0x7f493f7e5850, pool=0x1685b60, createParams=0x7f493f7e5870, connectStringLength=29,
    connectString=0x16baa68 "OCI:SP:zpuyRFdHlCPgU6hJZQr8DC", passwordLength=<optimized out>, password=<optimized out>, userNameLength=<optimized out>,
    userName=<optimized out>, conn=0x7f48c400d660) at src/dpiConn.c:636
#14 dpiConn__create (conn=0x7f48c400d660, context=<optimized out>, userName=userName@entry=0x0, userNameLength=userNameLength@entry=0,
    password=password@entry=0x0, passwordLength=passwordLength@entry=0, connectString=0x16baa68 "OCI:SP:zpuyRFdHlCPgU6hJZQr8DC", connectStringLength=29,
    pool=pool@entry=0x1685b60, commonParams=commonParams@entry=0x0, createParams=createParams@entry=0x7f493f7e5870, error=error@entry=0x7f493f7e5850)
    at src/dpiConn.c:414
#15 0x00007f496245ad71 in dpiPool__acquireConnection (pool=pool@entry=0x1685b60, userName=userName@entry=0x0, userNameLength=userNameLength@entry=0,
    password=password@entry=0x0, passwordLength=0, params=params@entry=0x7f493f7e5870, conn=conn@entry=0x7f493f7e5958, error=error@entry=0x7f493f7e5850)
    at src/dpiPool.c:36
#16 0x00007f496245af1d in dpiPool_acquireConnection (pool=pool@entry=0x1685b60, userName=userName@entry=0x0, userNameLength=userNameLength@entry=0,
    password=password@entry=0x0, passwordLength=passwordLength@entry=0, params=0x7f493f7e5870, params@entry=0x0, conn=conn@entry=0x7f493f7e5958)
    at src/dpiPool.c:341
#17 0x0000000000406a5e in dpi__getConnFromPool (pool=0x1685b60) at db_util.c:197

sqlnet.log

SQLNET.RECV_TIMEOUT=200ms
SQLNET.OUTBOUND_CONNECT_TIMEOUT=300ms
DISABLE_OOB=on
ADR_BASE=/data01/oracle_diag

We had exactly this issue before adding DIAG* params in sqlnet.ora file. Here is a stackoverflow thread which suggested DIAG*'s addition.

tgulacsi commented 2 years ago

Maybe

poolCreateParams.sessionIncrement = 1; poolCreateParams.timeout = 100;

Would be more real-world (what libclntsh writers awaited for)?

quidstone commented 2 years ago

from doc here

dpiPoolCreateParams.timeout

Specifies the length of time (in seconds) after which idle sessions in the pool are terminated. Note that termination only occurs when the pool is accessed. The default value is 0 which means that no idle sessions are terminated. This value can be set after the pool has been created using the function dpiPool_setTimeout() and acquired using the function dpiPool_getTimeout().

dpiPoolCreateParams.sessionIncrement

Specifies the number of sessions that will be created by the session pool when more sessions are required and the number of sessions is less than the maximum allowed. This value is ignored if the dpiPoolCreateParams.homogeneous member has a value of 0. This value added to the dpiPoolCreateParams.minSessions member value must not exceed the dpiPoolCreateParams.maxSessions member value. The default value is 0.

I need a constant number of connections available at all time. so I added,

poolCreateParams.maxSessions = maxSessions;
poolCreateParams.minSessions = maxSessions;

Both have the same value assigned. I don't want to terminate idle connections ,so there will be no cost of creating one when needed. It can fail if there is none available within the pool timeout time. So I decided to assign session increment to zero. As there is no termination, max and min has same value assigned to the maxSessions param.

quidstone commented 2 years ago

@cjbj, Should we consider updating instant client to 19.12? Where should we go from here? Changing DB version is not an option right now.

I am getting the feeling that these tools work better with relatively new versions of oracle DB. But to be honest, this timeout feature is too basic to fail this way. So, I am not sure why it is not working as it should. Is there any other info needed for you guys to understand it better?

tgulacsi commented 2 years ago

I cannot unsee the similarity with https://github.com/godror/godror/issues/192#issuecomment-947808421 - we se SIGSEGVs when the pool is full (timeouts set), and acquiring new connections concurrently and fast enough.

Limiting the number of sessions on the call side (Go) evades this.

cjbj commented 2 years ago

The Go issue could be unrelated, so let's treat the issues separately for now.

I just finished a discussion with the OCI developers. They want to know what's going but, but as expected, a testcase would help diagnose the problem.

We could look at trace. Overall it would be best if you could create an Oracle support request for the problem so "sensitive" files like traces can be shared; I don't know whether you can do this since you have 11.2 DB? It would be channelled through the OCI support team.

cjbj commented 2 years ago

@quidstone if you're unable to createa support request (I wonder what your support status of 11.2 is?), I am happy to have a call with you to gather information. Send me email (see my profile).

quidstone commented 2 years ago

hi @cjbj , I have already created a support request through our db-admin at out company. I admit that things are moving slow. This is more of our problem than anyone else's. We have to provide oracle support team traces from both servers(DB, client) when there is an issue. This is taking more time than it should. Unless I share those traces, I can't really tell what's next. We can talk about it a bit though. I will send you an email. thanks.