sijms / go-ora

Pure go oracle client
MIT License
796 stars 177 forks source link

Strange problem occurs in production #496

Closed freeNestor closed 6 months ago

freeNestor commented 9 months ago

We have a program running in production that has a query method like below. We set 30s read timeout and have multiple goroutines excuting this function for loop. srconn is connection pool shared among these goroutines, we set 15 connections as max limit actually.

func (ora *OraSource) QueryDataSlice(_ int, sqlstr string, srconn *sqlx.DB, _ *TableObject, start, end string, parallel any) ([]any, error) {
    var getst, qst, st time.Time
    var data []any
    ctxTimeout, cancel := context.WithTimeout(context.Background(), time.Second*time.Duration(prodcfg.ReadTimout))
    defer cancel()
    defer func() {
        Zaplogger.Debug("end rows scan start and append...", zap.String("sql", sqlstr), zap.Any("bind", []string{start, end}),
            zap.Int("parallel", parallel.(int)), zap.Int("count", len(data)),
            zap.Duration("getconn-duration", qst.Sub(getst)),
            zap.Duration("query-duration", st.Sub(qst)),
            zap.Duration("total-duration", time.Now().Local().Sub(getst)))
    }()
    getst = time.Now().Local()
    getconn, err := srconn.Connx(ctxTimeout)
    if err != nil {
        return nil, fmt.Errorf("query db data, getconn error: %v - sql: %s", err, sqlstr)
    }
    defer getconn.Close()
    qst = time.Now().Local()
    res, err := getconn.QueryxContext(ctxTimeout, sqlstr, start, end)
    if err != nil {
        return nil, fmt.Errorf("query db data error: %v - sql: %s", err, sqlstr)
    }
    defer res.Close()

    data = make([]any, 0, 500)
    st = time.Now().Local()
    for res.Next() {
        tmpd := make(map[string]any, 0)
        err = res.MapScan(tmpd)
        if err != nil {
            return nil, fmt.Errorf("data scaned error: %v - sql: %s", err, sqlstr)
        }
        data = append(data, tmpd)
    }
    err = res.Err()
    if err != nil {
        return nil, fmt.Errorf("iteration rows data error: %v - sql: %s", err, sqlstr)
    }
    return data, nil
}

Recently we have encountered some goroutines excuted timeout occasionally, logs below which is expected, but there's unexpected issue that one goroutine among these goroutines was blocking when executing query method above, never timeout, stopped logging. We forced restarting this program finally. So I wanna ask your help if there's some place handle exceptions in driver when using context with timeout.

[2024-01-11T06:04:51.741][ERROR][data/flow.go:182]{"info": "flowid:7,src-table:tab1", "driver": "oracle", "error": "query db current time error: ORA-01013: user requested cancel of current operation\n - sql: SELECT TO_CHAR(CURRENT_TIMESTAMP(6),'YYYY-MM-DD HH24:MI:SS.FF6') FROM dual"}
[2024-01-11T06:04:51.762][ERROR][data/flow.go:1817]{"database": "src", "task": "maincom", "info": "flowid:2,src-table:tab6", "timeslice": "2024-01-11 06:03:40.000000_2024-01-11 06:03:41.000000", "error": "iteration rows data error: ORA-01013: user requested cancel of current operation\n - sql: SELECT /*+ PARALLEL(1) */ ID,UPDATE_TIME,ORDERNO FROM tab6 WHERE UPDATE_TIME >= TO_TIMESTAMP(:1,'YYYY-MM-DD HH24:MI:SS.FF6') and UPDATE_TIME < TO_TIMESTAMP(:2,'YYYY-MM-DD HH24:MI:SS.FF6')"}
[2024-01-11T06:04:52.004][ERROR][data/flow.go:277]{"info": "flowid:3,src-table:tab2", "driver": "oracle", "error": "query db data error: ORA-01013: user requested cancel of current operation\n - sql: SELECT /*+ PARALLEL(1) */ ID,APPLICATION_ID,ORDER_TIME,BIZORDERNO,MFEE,FEECYCLE,MERCHANT_ID,FEE_RECORD_TIME,CREATE_TIME,UPDATE_TIME FROM tab1 WHERE UPDATE_TIME >= TO_TIMESTAMP(:1,'YYYY-MM-DD HH24:MI:SS.FF6') and UPDATE_TIME < TO_TIMESTAMP(:2,'YYYY-MM-DD HH24:MI:SS.FF6')"}
[2024-01-11T06:04:52.005][ERROR][data/flow.go:277]{"info": "flowid:72,src-table:tab3", "driver": "oracle", "error": "iteration rows data error: ORA-01013: user requested cancel of current operation\n - sql: SELECT /*+ PARALLEL(1) */ ID,DOMAINID,USERID,SK_AMOUNT,YS_AMOUNT,TK_AMOUNT,UNFREEZE_AMOUNT,BIZID,YS_TK_AMOUNT,YS_UNFREEZE_AMOUNT,CREATE_TIME,PAYEE_NAME,PAYER_NAME,PAYER_USER_ID,BIZ_PAYER_ID,UPDATE_TIME,APPLICATION_ID,BIZ_ORDERNO FROM tab2 WHERE UPDATE_TIME >= TO_TIMESTAMP(:1,'YYYY-MM-DD HH24:MI:SS.FF6') and UPDATE_TIME < TO_TIMESTAMP(:2,'YYYY-MM-DD HH24:MI:SS.FF6')"}
[2024-01-11T06:04:52.007][ERROR][data/flow.go:277]{"info": "flowid:73,src-table:tab4", "driver": "oracle", "error": "query db data error: ORA-01013: user requested cancel of current operation\n - sql: SELECT /*+ PARALLEL(1) */ ID,DOMAINID,AGENTORDER_ID,AGENTORDER_LABEL,BIZUSERID,USERID,SK_AMOUNT,YS_AMOUNT,TK_AMOUNT,UNFREEZE_AMOUNT,BIZID,YS_TK_AMOUNT,YS_UNFREEZE_AMOUNT,CREATE_TIME,PAYEE_NAME,PAYER_NAME,PAYER_USER_ID,BIZ_PAYER_ID,UPDATE_TIME,APPLICATION_ID,BIZ_ORDERNO FROM tab4 WHERE UPDATE_TIME >= TO_TIMESTAMP(:1,'YYYY-MM-DD HH24:MI:SS.FF6') and UPDATE_TIME < TO_TIMESTAMP(:2,'YYYY-MM-DD HH24:MI:SS.FF6')"}
[2024-01-11T06:04:52.008][ERROR][data/flow.go:277]{"info": "flowid:70,src-table:tab5", "driver": "oracle", "error": "iteration rows data error: TTC error: received code 6 during response reading - sql: SELECT /*+ PARALLEL(1) */ ID,DOMAINID,AGENTORDER_ID,AGENTORDER_LABEL,BIZUSERID,USERID,SK_AMOUNT,YS_AMOUNT,TK_AMOUNT,UNFREEZE_AMOUNT,BIZID,YS_TK_AMOUNT,YS_UNFREEZE_AMOUNT,CREATE_TIME,PAYEE_NAME,PAYER_NAME,PAYER_USER_ID,BIZ_PAYER_ID,UPDATE_TIME,APPLICATION_ID,BIZ_ORDERNO FROM tab5 WHERE UPDATE_TIME >= TO_TIMESTAMP(:1,'YYYY-MM-DD HH24:MI:SS.FF6') and UPDATE_TIME < TO_TIMESTAMP(:2,'YYYY-MM-DD HH24:MI:SS.FF6')"}
[2024-01-11T06:04:52.009][ERROR][data/flow.go:277]{"info": "flowid:77,src-table:tab6", "driver": "oracle", "error": "iteration rows data error: ORA-01013: user requested cancel of current operation\n - sql: SELECT /*+ PARALLEL(1) */ ID,DOMAINID,AGENTORDER_ID,AGENTORDER_LABEL,BIZUSERID,USERID,SK_AMOUNT,YS_AMOUNT,TK_AMOUNT,UNFREEZE_AMOUNT,BIZID,YS_TK_AMOUNT,YS_UNFREEZE_AMOUNT,CREATE_TIME,PAYEE_NAME,PAYER_NAME,PAYER_USER_ID,BIZ_PAYER_ID,UPDATE_TIME,APPLICATION_ID,BIZ_ORDERNO FROM tab7 WHERE UPDATE_TIME >= TO_TIMESTAMP(:1,'YYYY-MM-DD HH24:MI:SS.FF6') and UPDATE_TIME < TO_TIMESTAMP(:2,'YYYY-MM-DD HH24:MI:SS.FF6')"}
[2024-01-11T06:04:52.049][ERROR][data/flow.go:182]{"info": "flowid:79,src-table:tab7", "driver": "oracle", "error": "query db current time error: ORA-01013: user requested cancel of current operation\n - sql: SELECT TO_CHAR(CURRENT_TIMESTAMP(6),'YYYY-MM-DD HH24:MI:SS.FF6') FROM dual"}
sijms commented 9 months ago

did you try to increase timeout more than 30 seconds? after 30 seconds driver will send interrupt message to the server and cancel the current operation

freeNestor commented 9 months ago

did you try to increase timeout more than 30 seconds? after 30 seconds driver will send interrupt message to the server and cancel the current operation

According to the logs I provided above, some goroutines did time out and cancel the operation. My point is that although timeout is set, one of them is blocked without timeout. I'm confused.

sijms commented 9 months ago

for the driver there are 2 timeouts you can set 1- timeout of individual operation using context 2- global timeout. set using url option timeout default=120 seconds

the timeout with context will first try to cancel the current operation (which need network read/write) the global timeout will stop blocking operation when server refuse to respond

freeNestor commented 9 months ago

for the driver there are 2 timeouts you can set 1- timeout of individual operation using context 2- global timeout. set using url option timeout default=120 seconds

the timeout with context will first try to cancel the current operation (which need network read/write) the global timeout will stop blocking operation when server refuse to respond

Both of 1 and 2, we have set. Maybe we need print more debug logs for tracing. Thanks for answers, I need to wait until next time the problem arises and catch pprof profiles.

sijms commented 8 months ago

I find a bug may be related to this problem sometime server will send data packet with data flag = 0x40 which means final packet so I should check for that and close the connection and send driver.ErrBadConn and this is what I will do in next release

I also add new url option trace dir so debugging connection in parallel is easy. with this option the driver will create a trace file for each connection

freeNestor commented 8 months ago

Good news

jttodd commented 7 months ago

We noticed this in sijms/go-ora/v2 v2.7.17 with Oracle Exadata ... reducing the number of active bind variables to < 3,500 greatly reduces the occurrences