Panic in godror during database failover

veqryn commented 4 years ago

Describe the bug We run a multi-az AWS RDS with Oracle. This means it has 1 replica/clone, that can be failed over to at any time. We tested a failover today, and immediately received tons of panics from the godror library, bringing down our servers. If the driver is having trouble connecting for the half-second that the failover is happening, I would normally expect regular errors or something else that can be dealt with gracefully.

To Reproduce That might be tough...

Expected behavior Regular errors, while the driver tries to reconnect in the background.

Screenshots

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x158 pc=0xbb6109]

goroutine 99 [running]:
github.com/godror/godror.(*statement).QueryContext(0xc0010b8f00, 0x1148b00, 0xc00335cb40, 0x183d370, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
    /go/src/github.com/ReturnPath/ecm/vendor/github.com/godror/godror/stmt.go:514 +0x179
database/sql.ctxDriverStmtQuery(0x1148b00, 0xc00335cb40, 0x1148d00, 0xc0010b8f00, 0x183d370, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /usr/local/go/src/database/sql/ctxutil.go:82 +0x210
database/sql.rowsiFromStatement(0x1148b00, 0xc00335cb40, 0x1143080, 0xc0015a84e0, 0xc000c8e280, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /usr/local/go/src/database/sql/sql.go:2604 +0x16a
database/sql.(*Stmt).QueryContext(0xc000134360, 0x1148b00, 0xc00335cb40, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
    /usr/local/go/src/database/sql/sql.go:2556 +0x210
github.com/jmoiron/sqlx.(*qStmt).QueryxContext(0xc000c50dd8, 0x1148b00, 0xc00335cb40, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1810440, 0x1148b00, ...)
    /go/src/github.com/ReturnPath/ecm/vendor/github.com/jmoiron/sqlx/sqlx_context.go:332 +0x73
github.com/jmoiron/sqlx.(*Stmt).QueryxContext(0xc0005faa00, 0x1148b00, 0xc00335cb40, 0x0, 0x0, 0x0, 0x0, 0x45cb4e, 0xc015851da0)
    /go/src/github.com/ReturnPath/ecm/vendor/github.com/jmoiron/sqlx/sqlx_context.go:324 +0x88

Your oracle client version e.g. 12.1.0.2.0

Your godror version e.g. v0.13.4

Your go version e.g. 1.14.1

Your gcc version e.g. 7.3.0

Machine (please complete the following information):

OS: linux docker container for golang:1.14.1
Architecture x86_64
Version: debian buster

Additional context Our code surrounding the panic is pretty simple. We have a prepared statement that looks roughly like this: select ID, OTHER_ID from TABLE_OWNER.ID_TO_ID_T

It gets prepared on startup, then the location of the panic was here:

sqlxConn.selectAllIdAndOtherIDsStmt.QueryxContext(ctx)

tgulacsi commented 4 years ago

You can try v0.13.6, I've added a check to see whether the underlying connection has vanished, but it'll reappear somewhere else - for example in #48.

tgulacsi commented 4 years ago

BTW what is a failover? The other end (TCP) just vanishes? Or becomes a black hole? Or what?

veqryn commented 4 years ago

I probably actually prefer a panic to what happened in the other ticket. At least with the panic, when my pods crashed, they got restarted, and then didn't encounter any new issues after their restart. The other service just failed for hours and hours straight until someone noticed the logs.

veqryn commented 4 years ago

A failover is when we switch which of the two Oracle instances is the main and which is the replica/clone. This can happen if say one availability zone went down, or the instance itself went down. The endpoint/dns name doesn't change, but I guess the connection gets broken and then has to be re-established. https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Concepts.MultiAZ.html

cjbj commented 4 years ago

What are your values of enableEvents, standaloneConnection and setting for SetMaxIdleConns()?

tgulacsi commented 4 years ago

A failover is when we switch which of the two Oracle instances is the main and which is the replica/clone. This can happen if say one availability zone went down, or the instance itself went down. The endpoint/dns name doesn't change, but I guess the connection gets broken and then has to be re-established. https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Concepts.MultiAZ.html

That practically is just a mismash, nothing about the technical realization of the switch.

I've tried to replicate such a test (see TestConnCut at master branch) by creating a proxy and then closing it, but that's handled. Maybe the error is in my testing method: closing the proxy closes the TCP connections cleanly, not as AWS does...

Any ideas about reproducing the failover?

veqryn commented 4 years ago

SetMaxIdleConns was not used, so it is whatever is default.

Same for enableEvents, standaloneConnection. I'm not setting them anywhere, so whatever is default.

veqryn commented 4 years ago

We could probably just do the failover again, and see what happens. But unless I know to look somewhere to find the exact fixable problem, I think we'd just be back at square one.

tgulacsi commented 4 years ago

Without setting SetMaxIdleConns, any number of connections may be held in the *sql.DB (go) pool. So any number of spoiled connection can appear after the failover.

The problematic places:

where we don't return an error when ODPI-C returns DPI_FAILURE,
where we assume that some underlying pointer is there and not nil,
and where we don't return a driver.ErrBadConn though we should.

Reproducing this with a self-containd test case would be the best. If not possible, those have to test it who can reproduce it...

So please try again the failover with the current master (or v0.15.0-pre1), maybe tcpdump / wireshark the connection (TestConnCut detects the connection host:port), to see what's happening, and tweak TestConnCut till it repeats the fail, if possible,

cjbj commented 4 years ago

Our current thoughts are that for best HA you should be using the Oracle session pool, which can be made aware of database service events (via having enableEvents on and connecting to a DB service that is sending events such as FAN events). The session pool can also protect against various dead-session cases, even without FAN. To use the session pool you should set SetMaxIdleConns to 0 and set standaloneConnection to false.

This is generic info for Oracle DB. I don't know how multi-az AWS RDS is configured.

If you don't use the Oracle session pool then ideally godror would give an application error instead of a panic.

veqryn commented 4 years ago

@cjbj ok, so I know how to SetMaxIdleConns on the godror connection, but I have no idea what enableEvents and standaloneConnection is or how to set it. Can you explain?

@tgulacsi I will talk to the team and see if we can test a failover again, maybe one of our non-prod db's. Can you give me any instructions for using tcpdump / wireshark? I haven't used those before.

thx!

tgulacsi commented 4 years ago

standaloneConnection is false (0) by default. enableEvents: P := ParseConnString("...") ; P.EnableEvents=true; sql.Open("godror", P.StringWithPassword()) or directly asembling ConnectionParams.

tcpdump -w capture-$(date '+%Y%m%d_%H%M%S').pcap -i eth0 'port 1521'

or something like that. then

tshark -r capture-*.pcap

would spit out the communication.

veqryn commented 4 years ago

Ok, so we have a test Oracle database, but it isn't Multi-Availability-Zone. Only our prod db is, and we aren't planning to do another failover for quite some time unfortunately. I did however contact AWS support, and ask for additional details on what is happening during failover. Here is what they said, below. I can ask additional question on your behalf, if you'd like.

What is an aws rds failover doing behind the scenes?

if you have enabled Multi-AZ for RDS instance, In the event of a planned or unplanned outage of your DB instance, Amazon RDS automatically switches to a standby replica in another Availability Zone.

High Availability (Multi-AZ) for Amazon RDS - Failover Process for Amazon RDS - https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Concepts.MultiAZ.html#Concepts.MultiAZ.Failover

The time it takes for the failover to complete depends on the database activity and other conditions at the time the primary DB instance became unavailable. Failover times are typically 60-120 seconds. However, large transactions or a lengthy recovery process can increase failover time. When the failover is complete, it can take additional time for the RDS console UI to reflect the new Availability Zone.

Amazon RDS handles failovers automatically so you can resume database operations as quickly as possible without administrative intervention. The primary DB instance switches over automatically to the standby replica if any of the following conditions occur:

An Availability Zone outage

The primary DB instance fails

The DB instance's server type is changed

The operating system of the DB instance is undergoing software patching

A manual failover of the DB instance was initiated using Reboot with failover

How is it managing existing connections?

The existing connection with primary instance will be terminated. The failover mechanism automatically changes the DNS record of the DB instance to the standby DB instance. As a result of failover, you will need to re-establish the connections from Application.

What is it expected to do to prepared statements?

From the provided git logs https://github.com/godror/godror/issues/48 , you have prepared statements like below

i.e., select OTHER_ID from TABLE_OWNER.ID_TO_ID_T where ID = :1,

As I mentioned above, during RDS failover, database connections need to be re-established once again after DB has started. When DB is restarted or crashed, Oracle performs crash recovery, during this any un-commited transactions will be rolled back, committed transactions are rolled forward.

How are drivers expected to handle this?

I'm afraid that I won't be able to comment on how internal drivers handle them. However I referred RDS public documentation. I would recommend you to check if your Java client application is caching the Domain Name Service (DNS) data of your DB instances, set a time-to-live (TTL) value of less than 30 seconds. Because the underlying IP address of a DB instance will change after a failover, caching the DNS data for an extended time can lead to connection failures if your application tries to connect to an IP address that's no longer in service.

Setting the JVM TTL for DNS Name Lookups : https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Concepts.MultiAZ.html#Concepts.MultiAZ.Failover

Any additional help you can give?

If checkpoints are not happening in a long time, consider tuning check pointing with the help of "fast_start_mttr_target" parameter, if possible tune your workload to keep the transactions running short.

In order to tune fast_start_mttr_target, please seek help from your in house DBA.

Also, the DB instance class used can play an important role in the overall performance of the failover operation, as a bigger instance class, with more CPU and I/O throughput, can result in a faster recovery process than a smaller DB instance class.

tgulacsi commented 4 years ago

Thanks.

So,

the connection must be reestablished
DNS must be reacquired and should return the new address.

The first can be obeyed, but the second is harder: it's in the hands of the underlying OCI libraries. But I think this works with it. The real problem is with that TTL: for half a minute, it will still return the old, non-responding IP. After that, we get the new IP, but the DB is still starting, the listener may or may not accept connections for another 30-60 seconds.

As I can't guarantee that all this is handled gracefully, or loud enough (panic), the best would be some kind of trigger, event, hook, that is triggered, send, called by the _failovercontroller and restarts the application.

A workaround can be a program/goroutine pinging the database, and restarting the application when the IP has changed. This is crude, but maybe the simplest and most reliable solution.

But of course if you find other panic or misbehaviour in godror, I'll be happy to fix it!

veqryn commented 4 years ago

So, we may be closing this premature.

We run in kubernetes, so we have both liveness and readiness checks being run, about every 10 seconds each, against each pod (application) that is running. It hits an endpoint in our app. If that endpoint fails twice, the app stops receiving traffic, and if it fails 4 or 5 times, the pod gets (gracefully) killed and restarted. On my apps side, that endpoint checks that our app is in a healthy state, by making sure it still has access to it's dependencies, including by calling Ping on our Oracle DB connection.

Can we have ping return an error if we get into a bad state? That would be better than panic'ing, and also better than what happened in #48 , where all prepared statements simply stopped working. If not ping, I need another way to gracefully shut down the app.

tgulacsi commented 4 years ago

Ping should work.

See TestConnCut, add Ping, and test it.

godror / godror

Panic in godror during database failover #47