getConnection() method hang after few hours

Hash17677 commented 7 months ago

I have created the express server application which connect to oracle database in order to perform DB operations. That application works fine in dev environment and then I published it in ubuntu VM using pm2. Initially it works fine but after few hours getConnection('hr') method hang the request.

Inside application, First I create the oracle db connection pool at the startup of the express app. Here is the code for connection pool.

const initialize = async () => {
    const pool = await oracledb.createPool({
        user: 'username',
        password: 'password',
        connectString: 'x.x.x.x:1521/sampledev',
        poolMin: 5,
        poolMax: 10,
        poolIncrement: 1,
        poolAlias: "hr",
    });

    console.log("Connected to the oracle db pool");
}

Here is the endpoint that use oracle connection.

app.get('/dbstatus', async (req, res) => {
    try {
        oracleConnApps = await oracledb.getConnection('hr');
        console.log("DB Connected")

        // do stuff

        res.status(200).send({ status: true, data: "OK" });
    } catch (err) {
        console.log("CATCH EXECUTE")
        console.log({ status: true, data: err.message })
        res.status(400).send({ status: false, data: err.message });
    } finally {
        console.log("FINALLY EXECUTE")
        if (oracleConnApps) await oracleConnApps.close();
    }
})

I used oracledb: 6.2.0 version and thin mode to connect DB.

Initially it works fine. After few hours getConnection method hang the server and it does not trigger the catch block. If I restart node app using pm2 then it works. But same error happen after few hours.

sharadraju commented 7 months ago

Thank you for using node-oracledb.

Can you try enabling Thick mode and see if the problem is reproduced.

Hash17677 commented 7 months ago

I try the Thick mode as well. But it does not fix the error. @sharadraju

anthony-tuininga commented 7 months ago

A "hang" like you are reporting suggests that connections are not being returned to the pool. You can examine the pool statistics to see whether that is indeed the case. documentation.

cjbj commented 7 months ago

@Hash17677 as I mentioned in your original stackoverflow post, as well as checking pool stats, also see Always Use Connection Pools — and How.

Hash17677 commented 7 months ago

I monitor the statistics of the pool and notice connectionsOpen decrease the amount gradually. As I mentioned above I set poolMin to 5. But I notice connectionsOpen decrease the value under the poolmin. But It never increase.

@cjbj @anthony-tuininga

cjbj commented 7 months ago

@Hash17677 Is there any sign of queuing?

Hash17677 commented 7 months ago

No @cjbj I have not notice any queuing in the stats.

cjbj commented 7 months ago

@Hash17677 If you would like to share some stats and/or a reproducible testcase then I might be able to help you. Maybe it isn't connection exhaustion but is some DB lock in your app - you'll need to do some debugging.

If the code snippet you shared is actually what you are using, check your scope for oracleConnApps to make sure the close() is being called.

boriborm commented 6 months ago

I have same problem. Call ping inside getConnection hangs 11-16 minutes and catches ETIMEDOUT inside net stream.

stream error:
Error: read ETIMEDOUT
    at TCP.onStreamRead (node:internal/stream_base_commons:217:20) {
  errno: -110,
  code: 'ETIMEDOUT',
  syscall: 'read'
}

NJS-500: connection to the Oracle Database was broken
NJS-501: connection to host 172.20.188.174 port 1521 terminated unexpectedly. (CONNECTION_ID=UpVDSqbkhcGXyrjvZjKU5A==)

read ETIMEDOUT
    at NTTCP.checkErr (/snapshot/nodeoraserver/node_modules/oracledb/lib/thin/sqlnet/ntTcp.js:327:29)
    at NTTCP.receive (/snapshot/nodeoraserver/node_modules/oracledb/lib/thin/sqlnet/ntTcp.js:465:12)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at async NetworkSession._recvPacket (/snapshot/nodeoraserver/node_modules/oracledb/lib/thin/sqlnet/networkSession.js:378:22)
    at async NetworkSession.recvPacket (/snapshot/nodeoraserver/node_modules/oracledb/lib/thin/sqlnet/networkSession.js:443:12)
    at async ReadPacket.waitForPackets (/snapshot/nodeoraserver/node_modules/oracledb/lib/thin/protocol/packet.js:293:20)
    at async Protocol._decodeMessage (/snapshot/nodeoraserver/node_modules/oracledb/lib/thin/protocol/protocol.js:70:5)
    at async Protocol._processMessage (/snapshot/nodeoraserver/node_modules/oracledb/lib/thin/protocol/protocol.js:153:9)
    at async ThinConnectionImpl.ping (/snapshot/nodeoraserver/node_modules/oracledb/lib/thin/connection.js:977:5)
    at async ThinPoolImpl.acquire (/snapshot/nodeoraserver/node_modules/oracledb/lib/thin/pool.js:505:11)

TCP/IP connections may be broken by router/switch (restart). Why ping not working periodically (pingInterval) for free connections and not clean failure connections? In log i see errors only after call getConnection. Nothing for periodically pings? Why not working keepAlive for tcp connection?

Statistics before call getConnection:

Driver:
...thin mode: true
Pool statistics:
...gathered at: 2023-12-21T07:04:55.978Z
...up time (milliseconds): 73507015
...up time from last reset (milliseconds) 73506988
...connection requests: 10
...requests enqueued: 0
...requests dequeued: 0
...requests failed: 0
...requests exceeding queueMax: 0
...requests exceeding queueTimeout: 0
...current queue length: 0
...maximum queue length: 0
...sum of time in queue (milliseconds): 0
...minimum time in queue (milliseconds): 0
...maximum time in queue (milliseconds): 0
...average time in queue (milliseconds): 0
...pool connections in use: 0
...pool connections open: 2
Pool attributes:
...connectString: BANK
...edition: 
...events: false
...externalAuth: false
...homogeneous: true
...poolAlias: WORKER-2
...poolIncrement: 1
...poolMax: 5
...poolMaxPerShard: undefined
...poolMin: 2
...poolPingInterval (seconds): 30
...poolTimeout (seconds): 60
...queueMax: 500
...queueTimeout (milliseconds): 60000
...sessionCallback: undefined
...sodaMetaDataCache: undefined
...stmtCacheSize: 30
...user: USER
Related environment variables:
...UV_THREADPOOL_SIZE: 10

In project i use pkg and compile it with node16-linux-x64 target (node.js v16).

ptavasci-provart commented 6 months ago

Same problem here, using version 6.0.0 and docker container running on custom VLAN. We are catching ORA-03113 on Sentry on previous requests. Once it hangs, the node app doesn't respond anymore. I'll be updating on this topic once we run some stress test, trying to bring down the network connection with different methods.

NJS-500: connection to the Oracle Database was broken DPI-1080: connection was closed by ORA-3113

boriborm commented 6 months ago

Maybe I found solution.

Article https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die Part "Busy ESTAB socket is not forever" This is my situation. Keep alive working, but on Linux default value of TCP_KEEPALIVE_TIME = 7300 (>120 minutes). When socket broken and ping calls earlier then first keep alive after break, socket hangs ~16 minutes.

In article used parameter TCP_USER_TIMEOUT. This is analog of net.socket.setTimeout()?

What about set timeout of connection stream manually from connection options, or set timeout as TCP_USER_TIMEOUT = TCP_KEEPIDLE + TCP_KEEPINTVL TCP_KEEPCNT where [TCP_KEEPINTVL TCP_KEEPCNT = 1 * 10](https://nodejs.org/api/net.html#socketsetkeepaliveenable-initialdelay)

P.S. I set connection option expireTime to 1 minute and then in log i see droping of broken connections before call getConnection. But if ping calls within period before keep alive (1 minute), ping packet sends 16 minutes and then timeout (after send packet to socket, keep alive not working). Needs use setTimeout

sreguna commented 6 months ago

@boriborm setTimeout may timeout idle but valid connections. I think the correct solution would be a combination of expireTime and callTimeout (https://node-oracledb.readthedocs.io/en/latest/api_manual/connection.html#connection.callTimeout)

expireTime would timeout idle connections in the pool before it is acquired. If the connection is acquired before the keep alive idle timeout, then callTimeout would kick in and timeout the connection.

boriborm commented 6 months ago

@sreguna callTimeout is not variant. It sets timeout to send breaking packet. Sending breaking packet is the same as sending ping packet. Hangs ~16 minutes on TCP/IP level.

 async _processMessage(message) {
    let callTimer;
    let callTimeoutExpired = false;
    try {
      if (this.callTimeout > 0) {
        callTimer = setTimeout(() => {
          callTimeoutExpired = true;
          this.breakMessage();          <-----
        }, this.callTimeout);
      }
      await this._encodeMessage(message);
...

  breakMessage() {
    this._breakInProgress = true;
    this.nsi.sendBreak();         <----
  }

sendBreak() {
    if (this.isBreak)
      return; /* Already in a break */

    if (!this.connected) {
      this.isBreak = true; /* Not yet connected. Post the break */
      this.breakPosted = true;
      return;
    }

    this.isBreak = true;
    this.markerPkt.prepare(constants.NSPMKTD1, constants.NIQIMARK);
    this._sendPacket(this.markerPkt.buf);                 <----
  }

I think simple solution is use timeout only for ping. If ping timed out, break streaming of ping packet, force closing of stream and invalidate connection.

jd-apprentice commented 6 months ago

@sreguna callTimeout is not variant. It sets timeout to send breaking packet. Sending breaking packet is the same as sending ping packet. Hangs ~16 minutes on TCP/IP level.

 async _processMessage(message) {
    let callTimer;
    let callTimeoutExpired = false;
    try {
      if (this.callTimeout > 0) {
        callTimer = setTimeout(() => {
          callTimeoutExpired = true;
          this.breakMessage();          <-----
        }, this.callTimeout);
      }
      await this._encodeMessage(message);
...

  breakMessage() {
    this._breakInProgress = true;
    this.nsi.sendBreak();         <----
  }

sendBreak() {
    if (this.isBreak)
      return; /* Already in a break */

    if (!this.connected) {
      this.isBreak = true; /* Not yet connected. Post the break */
      this.breakPosted = true;
      return;
    }

    this.isBreak = true;
    this.markerPkt.prepare(constants.NSPMKTD1, constants.NIQIMARK);
    this._sendPacket(this.markerPkt.buf);                 <----
  }

I think simple solution is use timeout only for ping. If ping timed out, break streaming of ping packet, force closing of stream and invalidate connection.

The hang is consistent? because I've been having this issue and when it happens it maybe 16 minutes or hangs forever

sudarshan12s commented 6 months ago

I think after setting expireTime (i.e net socket setKeepAlive ) , indefinite hang is not seen .

sudarshan12s commented 6 months ago

I have same problem. Call ping inside getConnection hangs 11-16 minutes and catches ETIMEDOUT inside net stream.

stream error:
Error: read ETIMEDOUT
    at TCP.onStreamRead (node:internal/stream_base_commons:217:20) {
  errno: -110,
  code: 'ETIMEDOUT',
  syscall: 'read'
}

NJS-500: connection to the Oracle Database was broken
NJS-501: connection to host 172.20.188.174 port 1521 terminated unexpectedly. (CONNECTION_ID=UpVDSqbkhcGXyrjvZjKU5A==)

read ETIMEDOUT
    at NTTCP.checkErr (/snapshot/nodeoraserver/node_modules/oracledb/lib/thin/sqlnet/ntTcp.js:327:29)
    at NTTCP.receive (/snapshot/nodeoraserver/node_modules/oracledb/lib/thin/sqlnet/ntTcp.js:465:12)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at async NetworkSession._recvPacket (/snapshot/nodeoraserver/node_modules/oracledb/lib/thin/sqlnet/networkSession.js:378:22)
    at async NetworkSession.recvPacket (/snapshot/nodeoraserver/node_modules/oracledb/lib/thin/sqlnet/networkSession.js:443:12)
    at async ReadPacket.waitForPackets (/snapshot/nodeoraserver/node_modules/oracledb/lib/thin/protocol/packet.js:293:20)
    at async Protocol._decodeMessage (/snapshot/nodeoraserver/node_modules/oracledb/lib/thin/protocol/protocol.js:70:5)
    at async Protocol._processMessage (/snapshot/nodeoraserver/node_modules/oracledb/lib/thin/protocol/protocol.js:153:9)
    at async ThinConnectionImpl.ping (/snapshot/nodeoraserver/node_modules/oracledb/lib/thin/connection.js:977:5)
    at async ThinPoolImpl.acquire (/snapshot/nodeoraserver/node_modules/oracledb/lib/thin/pool.js:505:11)

TCP/IP connections may be broken by router/switch (restart). Why ping not working periodically (pingInterval) for free connections and not clean failure connections? In log i see errors only after call getConnection. Nothing for periodically pings? Why not working keepAlive for tcp connection?

Statistics before call getConnection:

Driver:
...thin mode: true
Pool statistics:
...gathered at: 2023-12-21T07:04:55.978Z
...up time (milliseconds): 73507015
...up time from last reset (milliseconds) 73506988
...connection requests: 10
...requests enqueued: 0
...requests dequeued: 0
...requests failed: 0
...requests exceeding queueMax: 0
...requests exceeding queueTimeout: 0
...current queue length: 0
...maximum queue length: 0
...sum of time in queue (milliseconds): 0
...minimum time in queue (milliseconds): 0
...maximum time in queue (milliseconds): 0
...average time in queue (milliseconds): 0
...pool connections in use: 0
...pool connections open: 2
Pool attributes:
...connectString: BANK
...edition: 
...events: false
...externalAuth: false
...homogeneous: true
...poolAlias: WORKER-2
...poolIncrement: 1
...poolMax: 5
...poolMaxPerShard: undefined
...poolMin: 2
...poolPingInterval (seconds): 30
...poolTimeout (seconds): 60
...queueMax: 500
...queueTimeout (milliseconds): 60000
...sessionCallback: undefined
...sodaMetaDataCache: undefined
...stmtCacheSize: 30
...user: USER
Related environment variables:
...UV_THREADPOOL_SIZE: 10

In project i use pkg and compile it with node16-linux-x64 target (node.js v16).

just to clarify, currently periodic cleanup of free connections only removes the connections that are idle more thanpoolTimeout configured value. poolPingInterval is checked during getConnection as it involves round trip. expireTime should enable keepAlive.

sudarshan12s commented 5 months ago

@boriborm, Thanks for suggestion of pingTimeout in comment. I am checking the feasibility of the same.

sudarshan12s commented 5 months ago

Same problem here, using version 6.0.0 and docker container running on custom VLAN. We are catching ORA-03113 on Sentry on previous requests. Once it hangs, the node app doesn't respond anymore. I'll be updating on this topic once we run some stress test, trying to bring down the network connection with different methods.

NJS-500: connection to the Oracle Database was broken DPI-1080: connection was closed by ORA-3113

Thick mode has SQLNET.RECV_TIMEOUT to limit the time for executes . details here

sharadraju commented 5 months ago

We have added poolPingTimeout parameter in the patch here. This will be included in the node-oracledb 6.4 release.

sharadraju commented 3 months ago

This has fixed in node-oracledb 6.4.

oracle / node-oracledb

getConnection() method hang after few hours #1626