tigase / tigase-server

(M) Highly optimized, extremely modular and very flexible XMPP/Jabber server
https://tigase.net
GNU Affero General Public License v3.0
326 stars 106 forks source link

problem with HA on AWS aurora mysql for DNS cache #196

Closed davidemarrone closed 3 months ago

davidemarrone commented 1 year ago

Issue migrated to: https://tigase.dev/tigase/_server/server-core/~issues/1496


Problem with DNS cache I was testing the AWS Aurora mysql failover procedure with tigase, reading the documentation I have found: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Concepts.MultiAZSingleStandby.html scrolling to: "Setting the JVM TTL for DNS name lookups"

explains that is important to set a TTL for the java cache otherwise the failover does not work properly.

I was testing it on an instance of tigase and during my test I have found that tigase does not auto recover, after triggering a failover manually, in the logs there is:

[2023-04-21 16:39:01:728] [WARNING ] [         cluster-nodes ] ClConSQLRepository.removeItem()  : Problem removing elements from DB: 
java.sql.SQLException: The MySQL server is running with the --read-only option so it cannot execute this statement
    at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:965)
    at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3978)
    at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3914)
    at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2530)
    at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2683)
    at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2495)
    at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1903)
    at com.mysql.jdbc.PreparedStatement.executeUpdateInternal(PreparedStatement.java:2124)
    at com.mysql.jdbc.PreparedStatement.executeUpdateInternal(PreparedStatement.java:2058)
    at com.mysql.jdbc.PreparedStatement.executeLargeUpdate(PreparedStatement.java:5158)
    at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2043)
    at jdk.internal.reflect.GeneratedMethodAccessor59.invoke(Unknown Source)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at tigase.db.jdbc.PreparedStatementInvocationHandler.invoke(PreparedStatementInvocationHandler.java:38)
    at com.sun.proxy.$Proxy35.executeUpdate(Unknown Source)
    at tigase.cluster.repo.ClConSQLRepository.removeItem(ClConSQLRepository.java:137)
    at tigase.cluster.repo.ClConConfigRepository.itemLoaded(ClConConfigRepository.java:178)
    at tigase.cluster.repo.ClConSQLRepository.reload(ClConSQLRepository.java:221)
    at tigase.db.comp.ConfigRepository$1.run(ConfigRepository.java:81)
    at java.base/java.util.TimerThread.mainLoop(Timer.java:556)
    at java.base/java.util.TimerThread.run(Timer.java:506)

this means that is still using the old endpoint of the DNS and it not switching to the new endpoint, I have left over 10 minutes always printing the same message.

Restarting the server resolves the problem but I need an automatic recovery.

I have change on the system in /etc/java-11-openjdk/security/java.security the value for networkaddress.cache.ttl as suggested in the AWS doc but I have the same results. There is any other way to set this parameter? Do you know why tigase do not consider the system configuration?

System info:

woj-tek commented 1 year ago

We already set this property: https://github.com/tigase/tigase-utils/blob/f681e263b95013886a983c9b0706c2c391cfd2ba/src/main/java/tigase/util/dns/DNSResolverDefault.java#L170-L170

Your instance should reconnect correctly after 2 minutes thuen.

davidemarrone commented 1 year ago

thank you for the feedback, unfortunately from my tests it seems that it does not work. I have done several tries and also after 10/20 minutes tigase does not reconnect.

In the comment I see: "we are caching ourselves" what does this exactly mean? There is some other cache level somewhere else?

Anyway could be a good idea to move the value of networkaddress.cache.ttl in some config params since is not possible to override if is present in the code.

woj-tek commented 1 year ago

In the comment I see: "we are caching ourselves" what does this exactly mean? There is some other cache level somewhere else?

There is a cache for IP and SRV records (https://github.com/tigase/tigase-utils/blob/88bc1e649e281e3eeece1abc7acc32647a93a398/src/main/java/tigase/util/dns/DNSResolverDefault.java#L46-L46)

Anyway could be a good idea to move the value of networkaddress.cache.ttl in some config params since is not possible to override if is present in the code.

I've made a small change that will virtually respect change from $JAVA_HOME/jre/lib/security/java.security (we will set it ourselves if it's not set prior/in system)

davidemarrone commented 1 year ago

In the comment I see: "we are caching ourselves" what does this exactly mean? There is some other cache level somewhere else?

There is a cache for IP and SRV records (https://github.com/tigase/tigase-utils/blob/88bc1e649e281e3eeece1abc7acc32647a93a398/src/main/java/tigase/util/dns/DNSResolverDefault.java#L46-L46)

ok but this should not interact / cache the dataSource uri, right?

Anyway could be a good idea to move the value of networkaddress.cache.ttl in some config params since is not possible to override if is present in the code.

I've made a small change that will virtually respect change from $JAVA_HOME/jre/lib/security/java.security (we will set it ourselves if it's not set prior/in system)

ok thanks, have you ever done a failover test with AWS aurora mysql with success?

Tomorrow I will try a newer version to see if the 8.2 or 8.3 have the same problem

davidemarrone commented 1 year ago

I have done a test upgrading to 8.2.2 but I have the same problem, I have also tried to add useCallableMysqlWorkaround = 'true' just to test if this could be the problem but nothing change, after 120 secs tigase still tries to connect to the old instance of mysql logging the exception The MySQL server is running with the --read-only option so it cannot execute this statement

woj-tek commented 1 year ago

ok but this should not interact / cache the dataSource uri, right?

Correct, internal cache of DNS resolver is only used in XMPP layer. DNS resolution for data sources is done via JDBC driver.

ok thanks, have you ever done a failover test with AWS aurora mysql with success?

Yes, but it was a couple of years back (when we changed the JVM DNS caching setting) and it was working correctly.

Regarding your current issue - it seems that for some reason, during failover, Tigase gets disconnected from the database but it manages to reconnect to the older instance that was made slave/read-only instance.

Which RDS endpoint address do you use (the cluster one?). Does it point to write replica after performing failover? Which type of failover do you perform?

davidemarrone commented 1 year ago

I'm triggering the failover from the console, please note that I'm using an AWS aurora cluster with mysql compatible with 5.7 (and not AWS RDS multi az that has a different failover strategy). In this scenario I just select the writer instance of the cluster and I click "failover", the cluster has just a writer and one replica.

Yes I'm using the cluster endpoint and not the instance one. If I restart tigase all works fine. So it means that after restart without changing the conf it just resolve to new address.

I do not think that during failover Tigase gets disconnected, I guess there is something cached and tigase and it is still use for some reason the old connection with host resolved before the failover, in the test config i use 'pool-size' = '12'

Is it possible that it is trying to use another connection from the pool already bounded to old mysql instance but not used yet?

If I remember correctly tigase create all the connections of the db pool at startup so I guess that during the failover all the connections just remain attached to the old instances since aurora switch the DNS and I think doesn't close active connections, maybe the only solution is to manage explicitly that exception in tigase and re-create all connections of the db pool.

The manual failover is useful to change the instance type with limited down and it is also used by AWS during the upgrade of mysql version

woj-tek commented 1 year ago

Indeed, if there is no disconnection then the existing database connection pool will remain with all the connections.

If you open mysql connection from the shell and do the failover does the connection stay open? Can you then operate on it or (as it seems) you are still connected to the same instance but it's made RO (just like in Tigase case)?

davidemarrone commented 1 year ago

I have made the test during the failover, the mysql client do not trigger any error but then if I make a query this is the result:

mysql> select * from tig_users order by uid desc limit 0,2 \G
ERROR 2013 (HY000): Lost connection to MySQL server during query
No connection. Trying to reconnect...
Connection id:    11
Current database: tigasedb

I have added to the test netstat -tpn | grep 3306 there is nothing on the machine beside the mysql client when is connected before the failover:

tcp        0      0 172.31.30.220:1332      172.31.11.137:3306      ESTABLISHED 2280/mysql

after the failover

tcp        1      0 172.31.30.220:1332      172.31.11.137:3306      CLOSE_WAIT  2280/mysql    

so it seems that the failover actually close the TCP connection, so I have then run another test with this script:

date
mysql --ssl-mode=DISABLED -h clusterhostname  -u tigase -ppassword -D tigasedb -e "select 1"  2>&1 |grep -v "Using a password on the command line interface"

dig +short clusterhostname

this is the output

Wed 26 Apr 2023 03:27:07 PM UTC
1
1
cluster-instance1.eu-west-1.rds.amazonaws.com.
172.31.11.137
Wed 26 Apr 2023 03:27:08 PM UTC
1
1
cluster-instance1.eu-west-1.rds.amazonaws.com.
172.31.11.137
Wed 26 Apr 2023 03:27:09 PM UTC
1
1
cluster-instance1.eu-west-1.rds.amazonaws.com.
172.31.11.137
Wed 26 Apr 2023 03:27:10 PM UTC
ERROR 2003 (HY000): Can't connect to MySQL server on 'clusterendpoint.eu-west-1.rds.amazonaws.com:3306' (111)
cluster-instance1.eu-west-1.rds.amazonaws.com.
172.31.11.137
Wed 26 Apr 2023 03:27:11 PM UTC
ERROR 2003 (HY000): Can't connect to MySQL server on 'clusterendpoint.eu-west-1.rds.amazonaws.com:3306' (111)
cluster-instance1.eu-west-1.rds.amazonaws.com.
172.31.11.137
Wed 26 Apr 2023 03:27:12 PM UTC
ERROR 2003 (HY000): Can't connect to MySQL server on 'clusterendpoint.eu-west-1.rds.amazonaws.com:3306' (111)
cluster-instance1.eu-west-1.rds.amazonaws.com.
172.31.11.137
Wed 26 Apr 2023 03:27:13 PM UTC
ERROR 2003 (HY000): Can't connect to MySQL server on 'clusterendpoint.eu-west-1.rds.amazonaws.com:3306' (111)
cluster-instance1.eu-west-1.rds.amazonaws.com.
172.31.11.137
Wed 26 Apr 2023 03:27:14 PM UTC
ERROR 2003 (HY000): Can't connect to MySQL server on 'clusterendpoint.eu-west-1.rds.amazonaws.com:3306' (111)
cluster-instance1.eu-west-1.rds.amazonaws.com.
172.31.11.137
Wed 26 Apr 2023 03:27:15 PM UTC
ERROR 2003 (HY000): Can't connect to MySQL server on 'clusterendpoint.eu-west-1.rds.amazonaws.com:3306' (111)
cluster-instance1.eu-west-1.rds.amazonaws.com.
172.31.11.137
Wed 26 Apr 2023 03:27:16 PM UTC
ERROR 2003 (HY000): Can't connect to MySQL server on 'clusterendpoint.eu-west-1.rds.amazonaws.com:3306' (111)
cluster-instance1.eu-west-1.rds.amazonaws.com.
172.31.11.137
Wed 26 Apr 2023 03:27:17 PM UTC
ERROR 2003 (HY000): Can't connect to MySQL server on 'clusterendpoint.eu-west-1.rds.amazonaws.com:3306' (111)
cluster-instance1.eu-west-1.rds.amazonaws.com.
172.31.11.137

# -> this is really strage it seems that for some seconds it resolve with the secondary and allow connection:
# -> ----- start
Wed 26 Apr 2023 03:27:18 PM UTC
1
1
cluster-instance1.eu-west-1.rds.amazonaws.com.
172.31.11.137
Wed 26 Apr 2023 03:27:19 PM UTC
1
1
cluster-instance1.eu-west-1.rds.amazonaws.com.
172.31.11.137
Wed 26 Apr 2023 03:27:20 PM UTC
1
1
cluster-instance1.eu-west-1.rds.amazonaws.com.
172.31.11.137
Wed 26 Apr 2023 03:27:21 PM UTC
1
1
cluster-instance1.eu-west-1.rds.amazonaws.com.
172.31.11.137
Wed 26 Apr 2023 03:27:22 PM UTC
1
1
cluster-instance1.eu-west-1.rds.amazonaws.com.
172.31.11.137
Wed 26 Apr 2023 03:27:23 PM UTC
1
1
# -> ----- end

cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7
Wed 26 Apr 2023 03:27:24 PM UTC
1
1
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7
Wed 26 Apr 2023 03:27:25 PM UTC
1
1
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7
Wed 26 Apr 2023 03:27:26 PM UTC
1
1

as in the comment it seems that short after the failover the DNS still point to the old primary writer allowing connection but in read only, it last for some seconds, so I have done this test to confirm it:

date
mysql --ssl-mode=DISABLED -h clusterhostname  -u tigaseUser -ppassword -D tigasedb -e "select @@hostname;  update tig_offline_messages set expired = now() where msg_id = 1; "  2>&1 |grep -v "Using a password on the command line interface"
dig +short clusterhostname
echo 

and as expected this is the problem

Wed 26 Apr 2023 03:40:54 PM UTC
@@hostname
ip-10-4-2-74
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7

Wed 26 Apr 2023 03:40:55 PM UTC
@@hostname
ip-10-4-2-74
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7

Wed 26 Apr 2023 03:40:56 PM UTC
@@hostname
ip-10-4-2-74
ERROR 2013 (HY000) at line 1: Lost connection to MySQL server during query
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7

Wed 26 Apr 2023 03:40:57 PM UTC
ERROR 2003 (HY000): Can't connect to MySQL server on 'clusterendpoint.eu-west-1.rds.amazonaws.com:3306' (111)
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7

Wed 26 Apr 2023 03:40:58 PM UTC
ERROR 2003 (HY000): Can't connect to MySQL server on 'clusterendpoint.eu-west-1.rds.amazonaws.com:3306' (111)
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7

Wed 26 Apr 2023 03:40:59 PM UTC
ERROR 2003 (HY000): Can't connect to MySQL server on 'clusterendpoint.eu-west-1.rds.amazonaws.com:3306' (111)
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7

Wed 26 Apr 2023 03:41:00 PM UTC
ERROR 2003 (HY000): Can't connect to MySQL server on 'clusterendpoint.eu-west-1.rds.amazonaws.com:3306' (111)
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7

Wed 26 Apr 2023 03:41:01 PM UTC
ERROR 2003 (HY000): Can't connect to MySQL server on 'clusterendpoint.eu-west-1.rds.amazonaws.com:3306' (111)
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7

Wed 26 Apr 2023 03:41:02 PM UTC
ERROR 2003 (HY000): Can't connect to MySQL server on 'clusterendpoint.eu-west-1.rds.amazonaws.com:3306' (111)
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7

Wed 26 Apr 2023 03:41:03 PM UTC
ERROR 2003 (HY000): Can't connect to MySQL server on 'clusterendpoint.eu-west-1.rds.amazonaws.com:3306' (111)
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7

Wed 26 Apr 2023 03:41:04 PM UTC
ERROR 2003 (HY000): Can't connect to MySQL server on 'clusterendpoint.eu-west-1.rds.amazonaws.com:3306' (111)
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7

Wed 26 Apr 2023 03:41:05 PM UTC
ERROR 2003 (HY000): Can't connect to MySQL server on 'clusterendpoint.eu-west-1.rds.amazonaws.com:3306' (111)
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7

# after the failover still resolving with old ip and accepting connection but istance is in read only mode
Wed 26 Apr 2023 03:41:06 PM UTC
@@hostname
ip-10-4-2-74
ERROR 1290 (HY000) at line 1: The MySQL server is running with the --read-only option so it cannot execute this statement
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7

Wed 26 Apr 2023 03:41:07 PM UTC
@@hostname
ip-10-4-2-74
ERROR 1290 (HY000) at line 1: The MySQL server is running with the --read-only option so it cannot execute this statement
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7

Wed 26 Apr 2023 03:41:08 PM UTC
@@hostname
ip-10-4-2-74
ERROR 1290 (HY000) at line 1: The MySQL server is running with the --read-only option so it cannot execute this statement
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7

Wed 26 Apr 2023 03:41:09 PM UTC
@@hostname
ip-10-4-2-74
ERROR 1290 (HY000) at line 1: The MySQL server is running with the --read-only option so it cannot execute this statement
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7

Wed 26 Apr 2023 03:41:10 PM UTC
@@hostname
ip-10-4-2-74
ERROR 1290 (HY000) at line 1: The MySQL server is running with the --read-only option so it cannot execute this statement
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7

# after some secs goes to the correct ip
Wed 26 Apr 2023 03:41:11 PM UTC
@@hostname
ip-10-4-1-206
cluster-instance1.eu-west-1.rds.amazonaws.com.
172.31.11.137

Wed 26 Apr 2023 03:41:12 PM UTC
@@hostname
ip-10-4-1-206
cluster-instance1.eu-west-1.rds.amazonaws.com.
172.31.11.137

Wed 26 Apr 2023 03:41:13 PM UTC
@@hostname
ip-10-4-1-206
cluster-instance1.eu-west-1.rds.amazonaws.com.
172.31.11.137

with an update query I can trigger the "The MySQL server is running with the --read-only", after 5 secs it come back to the correct host, so this is what happening, tigase connects immediately to old writer instance in read only mode and stay stuck there forever, I guess you must manage explicitly that exception

davidemarrone commented 1 year ago

I have found the same problem described here: https://proxysql.com/blog/failover-comparison-in-aurora-mysql-2-10-0-using-proxysql-vs-auroras-cluster-endpoint/

woj-tek commented 1 year ago

So, they ran into the same issue but reading your previous comment:

as in the comment it seems that short after the failover the DNS still point to the old primary writer allowing connection but in read only, it last for some seconds, so I have done this test to confirm it:

I wonder if this isn't an issue with the RDS. I mean - if they do the failover they should coordinate the disconnection/DNS update/making instance available again. Because currently they currently do disconnect, make (now reader) instance available under cluster enpoint, update dns. They should switch the last two steps.

Have you considered rising the issue with the AWS tech support?

davidemarrone commented 1 year ago

Yes it seems a bug on AWS, I will try to open a case also if I know it will take a long time and is not guarantee that they will fix it. Do you think you are going to manage it and make a workaround on your side?

woj-tek commented 1 year ago

I filed a ticket for it (ref: server-1354) but it's not on our immediate agenda and doesn't have ETA.

davidemarrone commented 1 year ago

thank you, I will update here if I have some answers from AWS

davidemarrone commented 1 year ago

anyway I think should be also a quite important topic on your side. If I send a message to tigase to a user that is offline, and the DB is in that state the messages get lost without returning any error to the sender, it would better if tigase just shut down completely

woj-tek commented 1 year ago

Hmm, most database exceptions (and this is just a SQLException in the end) should yield an error returned. Could you share ST/error log from the case resulting in message loss?

davidemarrone commented 1 year ago

This is the exception:

[2023-04-28 17:00:19:922] [WARNING ] [              in_4-amp ] JDBCMsgRepository.storeMessage() : Problem adding new entry to DB: 
java.sql.SQLException: The MySQL server is running with the --read-only option so it cannot execute this statement
    at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:965)
    at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3978)
    at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3914)
    at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2530)
    at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2683)
    at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2495)
    at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1903)
    at com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:2011)
    at com.mysql.jdbc.CallableStatement.executeQuery(CallableStatement.java:872)
    at jdk.internal.reflect.GeneratedMethodAccessor52.invoke(Unknown Source)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at tigase.db.jdbc.PreparedStatementInvocationHandler.invoke(PreparedStatementInvocationHandler.java:38)
    at com.sun.proxy.$Proxy35.executeQuery(Unknown Source)
    at tigase.server.amp.db.JDBCMsgRepository.storeMessage(JDBCMsgRepository.java:414)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at tigase.stats.StatisticsInvocationHandler.invoke(StatisticsInvocationHandler.java:75)
    at com.sun.proxy.$Proxy38.storeMessage(Unknown Source)
    at tigase.server.amp.db.MsgRepository$MsgRepositoryMDBean.storeMessage(MsgRepository.java:326)
    at tigase.server.amp.action.Store.execute(Store.java:116)
    at tigase.server.amp.AmpComponent.processPacket(AmpComponent.java:198)
    at tigase.cluster.AmpComponentClustered.processPacket(AmpComponentClustered.java:72)
    at tigase.server.AbstractMessageReceiver$QueueListener.run(AbstractMessageReceiver.java:1405)

I do not receive any error on the XMPP submit, I guess that tigase is async and accept the message that needs to be delivered and sends an ack to the sender then it found that the recipient is offline and tries to store the message to the MsgRepository. If the recipient is online the delivery works also if mysql is not working but if is online this exception is triggered without any errors for the sender

woj-tek commented 1 year ago

Tigase in general operates in async manner (it's based on message passing, so in a way it's event based).

In this case there should be an error returned, we will add it in the future (ref: server-1355)

davidemarrone commented 1 year ago

I'm having a big argue with AWS support about this bug, basically they says:

I have conducted some testing in my lab and have observed the same outcome as yourself. I also noticed a 5-second delay in updating the DNS, which I would like to clarify is not a bug, but rather it is by design. Aurora DNS zones have a short time-to-live (TTL) of 5 seconds, and it is crucial that your network and client configurations do not further increase this. This is how Aurora is designed, and the delay you are experiencing is indeed expected.

linking me this article: https://repost.aws/knowledge-center/aurora-mysql-db-cluser-read-only-error

anyway now I have another doubt, why at some point tigase do not start to resolve the correct DNS entry? Also if after the down it connects to the wrong instance and starts to issue the errors describe at the beginning:

[2023-04-21 16:39:01:728] [WARNING ] [         cluster-nodes ] ClConSQLRepository.removeItem()  : Problem removing elements from DB: 
java.sql.SQLException: The MySQL server is running with the --read-only option so it cannot execute this statement

after the networkaddress.cache.ttl it should start to resolving the correct DNS entry for database and stop logging errors but I left it also for 10 minutes and nothing change, still logging the error.

Is it possible that there are other cache further than networkaddress.cache.ttl for the DNS resolution of the database address in tigase?

davidemarrone commented 1 year ago

I have also found this article: https://aws.amazon.com/blogs/database/improve-application-availability-on-amazon-aurora/

that exactly explains this problem, is written by AWS developers, as described in the article:

"the client kept trying to connect to the old writer, including a couple of seconds after that writer has already been demoted to a reader. "

is a issue related to AWS aurora and the solution is to decrease the DNS TTL or use RDS proxy or maybe use AWS mysql drivers.

Anyway as written in the previous comment I do not understand why tigase after the networkaddress.cache.ttl do not starts to resolve the correct DNS address

woj-tek commented 1 year ago

after the networkaddress.cache.ttl it should start to resolving the correct DNS entry for database and stop logging errors but I left it also for 10 minutes and nothing change, still logging the error.

Is it possible that there are other cache further than networkaddress.cache.ttl for the DNS resolution of the database address in tigase?

Unfortunately not. This exception is not connection exception but merely execution exception so Tigase successfully connected (TCP connection) to the instance (albeit wrong one) and persist the connection. So unless the connection is terminated one way or another (a restart in this case) it will keep those connections in the pool.

So, adding handling of this situation and recreating connection pool would resolve it.

But I'd argue, that it's still a bug on AWS, and they should update the failover procedure so the moment it happens, the endpoint already points to the correct instance.

EDIT: sorry, missed your other reply.

is a issue related to AWS aurora and the solution is to decrease the DNS TTL or use RDS proxy or maybe use AWS mysql drivers.

Anyway as written in the previous comment I do not understand why tigase after the networkaddress.cache.ttl do not starts to resolve the correct DNS address

Tigase does not start to (re-)connect to the old DNS address - when the failover happens and connection is terminated Tigase reconnects to the database at once (thus, within 5s window), which results in old DNS entry. After that there is no DNS resolution as there is no reconnection attempt.

We do use official MySQL JDBC driver, though AWS seems to suggest using (forked) MariaDB JDBC driver, which supposedly has support for the use-case.

davidemarrone commented 1 year ago

This exception is not connection exception but merely execution exception so Tigase successfully connected (TCP connection) to the instance (albeit wrong one) and persist the connection

got it

But I'd argue, that it's still a bug on AWS, and they should update the failover procedure so the moment it happens, the endpoint already points to the correct instance.

I have wrote a comment on their blog https://aws.amazon.com/blogs/database/improve-application-availability-on-amazon-aurora/ upvote or maybe comment yourself if you like

davidemarrone commented 1 year ago

The AWS developers has given an official response in the comments of: https://aws.amazon.com/blogs/database/improve-application-availability-on-amazon-aurora/

TL;DR -> They are not going to make any changes so must be handled from tigase side

woj-tek commented 1 year ago

Thanks for the update. I'm not really thrilled by the AWS response, and not sure if it's exactly accurate... They could simply avoid making instance available again after failover before updating the DNS, they even highlight it in their example:

ERROR 2003 (HY000): Can't connect to MySQL server on 'aurora-test.cluster-xxx.us-west-2.rds.amazonaws.com:3306' (111)
ERROR 2003 (HY000): Can't connect to MySQL server on 'aurora-test.cluster-xxx.us-west-2.rds.amazonaws.com:3306' (111)
2022-12-09 18:57:12 reader
2022-12-09 18:57:13 reader
2022-12-09 18:57:14 writer

They could simply wait until DNS update and then bring both instances up...

Or they could even use their RDS Aurora proxy internally, that would solve the issue and would increase HA...

At any rate, we will probably add handling for this use-case, but there is no ETA.

davidemarrone commented 1 year ago

It is exactly what I commented on: wait until the DNS update and then bring both instances up. But I guess that by doing it this way, they have to introduce a longer downtime for those who use smart drivers. I think that is the reason why they have done it this way. For sure, they could use RDS Aurora proxy internally, but I guess they prefer that customers pay for it :).

I do not think they will change it in the future, as pointed out in the comments the only solutions to this problem are:

So without modifying tigase the only external solution is to use RDS proxy

woj-tek commented 1 year ago

As I said - we will probably add support for that, but in your case you can use suggested JDBC driver to mitigate the issue - either by swapping with one from tarball distribution package or modifying docker image (will require your own build though).

davidemarrone commented 1 year ago

Ok, thanks for the feedback, I will try to swap the JDBC driver

davidemarrone commented 1 year ago

I have tryed to swap the driver but I get:

[2023-05-30 07:27:47:000] [SEVERE  ] [ForkJoinPool-1-worker-23 ] DataRepositoryImpl.initialize(): null
java.lang.ClassNotFoundException: com.mysql.cj.jdbc.Driver
    at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581)
    at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
    at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
    at java.base/java.lang.Class.forName0(Native Method)
    at java.base/java.lang.Class.forName(Class.java:398)
    at tigase.db.jdbc.DataRepositoryImpl.initialize(DataRepositoryImpl.java:286)
    at tigase.db.beans.DataSourceBean$DataSourceMDConfigBean.initRepository(DataSourceBean.java:322)
    at tigase.db.beans.DataSourceBean$DataSourceMDConfigBean.initRepository(DataSourceBean.java:248)
    at tigase.db.beans.MDPoolConfigBean.lambda$setInstances$0(MDPoolConfigBean.java:166)
    at java.base/java.util.concurrent.ForkJoinTask$AdaptedCallable.exec(ForkJoinTask.java:1448)
    at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
    at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
    at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
    at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
    at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)

from AWS:

Use the driver name: software.aws.rds.jdbc.mysql.Driver. You'll need the driver name when loading the driver explicitly to the driver manager.

how can I override database.driverClassName from startup command line or config.tdsl?

woj-tek commented 1 year ago

Unfortunately, it turns out, we do have driver classes hardcoded, so the swapping won't work. I filed a ticket to remove it in the next version.

davidemarrone commented 1 year ago

Hi, I'm trying to fix this problem using AWS RDS proxy, it seems it works well, the only problem that I have is that the proxy force to close all the idle connection after 24 hours

RDS Proxy closes database connections some time after 24 hours when they are no longer in use. The proxy performs this action regardless of the value of the maximum idle connections setting.

and I get some error from in the tigase-console.log like com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure or com.mysql.cj.exceptions.StatementIsClosedException: No operations allowed after statement closed and some operations fails. In the configuration of tigase in the section of dataSource -> default can I use: 'conn-valid-query' = 'select 1' or is reserved only for authRepository? Do you have any idea how can I keep the mysql connection alive with ping or something similar? I saw in the driver the parameter autoReconnectForPools, reading on stackoverflow it seems what I need, but is not well documented from mysql ...

woj-tek commented 1 year ago

Query validation query is not limited to only auth repository but it's used in DataSource altogether, see: tigase.db.jdbc.DataRepositoryImpl#JDBC_CONNVALID_QUERY. There is a connection watchdog, which should be executed every hour (using select 1) on every connection to keep it up (see tigase.db.beans.DataSourceBean.DataSourceMDConfigBean#executeWatchdog)

I'm not familiar with autoReconnectForPools option and I'm not sure how it will work here.

davidemarrone commented 1 year ago

There is a connection watchdog, which should be executed every hour (using select 1) on every connection to keep it up (see tigase.db.beans.DataSourceBean.DataSourceMDConfigBean#executeWatchdog)

I have tried to add in the config in logging -> loggers

        'tigase.db.beans' {
            level = 'ALL'
        }

and in the datasource config

        'watchdog-frequency' = 'PT10M'

but in the logs grepping for watchdog I haven't find anything, how can I check if the watchdog is running or not?

davidemarrone commented 1 year ago

I have read the code of the mysql driver and unfortunately there is no way to manage directly the retry of the query from the driver, there is the autoReconnect parameter but in the native session the connection is re-established but the exception is thrown anyway.

I have also explored the functions of Failover and Load Balancing using the same host (of the proxy) multiple time but also this does not work because the driver re-connect correctly to the "other" host for the next query but anyway throw back the exception to the client using the method dealWithInvocationException. This method is used both for Failover and Load Balancing configuration

I hope there is a way on tigase to fix the watchdog to keep the connection alive.

woj-tek commented 1 year ago

and in the datasource config

Have you configured it in dataSource () {} or in particular data source, for example default () {}? (it will work only in the latter) and then you should see in the logs:

[2023-08-07 17:51:20:168] [FINEST  ] [       pool-4-thread-1 ] DataSourceBean$DataSourceMDConfigBean.executeWatchdog(): execution of watchdog for data source default
[2023-08-07 17:51:20:170] [FINEST  ] [       pool-4-thread-1 ] DataRepositoryImpl.checkConnectivity(): checking connection status tigase.db.jdbc.DataRepositoryImpl@51c15508, last query executed 1,691,423,480,170ms ago
davidemarrone commented 1 year ago

Ok thanks, I see now in the logs but it still doesn't work, can it be related to the pool of connections? the checkConnectivity test that just one connection is working or that all the connections in the pool are working? I do not understand what happens from the RDS proxy, at the moment the configuration of tigase for the DB is:

dataSource {
    default () {
        'pool-size' = '48'
        uri = 'jdbc:mysql://.......'

        'watchdog-frequency' = 'PT10M'
    }
}
userRepository {
    default () {
        'watchdog-frequency' = 'PT10M'
    }
}

authRepository {
    default () {      
      'update-login-time-query' = null
      'watchdog-frequency' = 'PT10M'
    }
}

From the doc

RDS Proxy considers a database connection in it's pool to be idle when there's been no activity on the connection for five minutes.

RDS Proxy closes database connections some time after 24 hours when they are no longer in use. The proxy performs this action regardless of the value of the maximum idle connections setting.

What I have see is that after 24 hours of using the proxy I get the errors in the tigase-console.log coming from the JDBC driver of

Caused by: com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure
Caused by: com.mysql.cj.exceptions.StatementIsClosedException: No operations allowed after statement closed.

I guess that the proxy close all the idle and for some reason tigase try to use one of that connection. But if the watchdog is running always maybe only the same or first connection of the pool it does not the problem, correct? What do you think? Is possible to run the watchdog on all the connections of the pool?

woj-tek commented 1 year ago

Ok thanks, I see now in the logs but it still doesn't work, can it be related to the pool of connections? the checkConnectivity test that just one connection is working or that all the connections in the pool are working?

It checks all the connections, see

https://github.com/tigase/tigase-server/blob/1c8a8843277c339a0d7a66847b2790d50996761a/src/main/java/tigase/db/DataRepositoryPool.java#L72-L74

Could you try setting watchdog to 4M (considering RDS Proxy 5 minutes mark)?

davidemarrone commented 1 year ago

Could you try setting watchdog to 4M (considering RDS Proxy 5 minutes mark)?

Good point! I will try and let you know.

Via JMX I use "tigase.stats:type=StatisticsProvider getAllStats 0" to get all stats, there is a way to know how many of the connections of the pools are actually used? I do not see it in the stats list.

woj-tek commented 1 year ago

Via JMX I use "tigase.stats:type=StatisticsProvider getAllStats 0" to get all stats, there is a way to know how many of the connections of the pools are actually used? I do not see it in the stats list.

What do you mean "used"? There is a pool of connections and its size is the only metric. There is no detailed statistics regarding usage of particular connections as the assumption is there should be used equally.

davidemarrone commented 1 year ago

What do you mean "used"?

With "used" I mean how many connections are running a query so I can know how many are "free" and how many are "used"

woj-tek commented 1 year ago

We don't do statistics about that. You can get this info from mysql and it's processlist.

davidemarrone commented 1 year ago

Could you try setting watchdog to 4M (considering RDS Proxy 5 minutes mark)?

unfortunately it didn't work, tigase still logs errors exactly after 24 hours for some queries failed with:

Caused by: com.mysql.cj.exceptions.StatementIsClosedException: No operations allowed after statement closed.
Caused by: com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure
woj-tek commented 1 year ago

Do you get watchdog entries in logs (i.e. DataRepositoryImpl.checkConnectivity(): checking connection status tigase.db.jdbc.DataRepositoryImpl…) for all connections from the pool? (there should be as many log entries as you have configured the pool)

davidemarrone commented 1 year ago

no, I do not see an entry for each connection of the pool

with dataSource -> default -> 'pool-size' = '5' there are 2 entries in the logs

[2023-08-30 13:03:58:006] [FINEST  ] [       pool-3-thread-1 ] DataSourceBean$DataSourceMDConfigBean.executeWatchdog(): execution of watchdog for data source default
[2023-08-30 13:07:58:006] [FINEST  ] [       pool-3-thread-1 ] DataSourceBean$DataSourceMDConfigBean.executeWatchdog(): execution of watchdog for data source default
[2023-08-30 13:07:58:006] [FINEST  ] [       pool-3-thread-1 ] DataRepositoryImpl.checkConnectivity(): checking connection status tigase.db.jdbc.DataRepositoryImpl@386f88b3, last query executed 466,008ms ago
[2023-08-30 13:07:58:008] [FINEST  ] [       pool-3-thread-1 ] DataRepositoryImpl.checkConnectivity(): checking connection status tigase.db.jdbc.DataRepositoryImpl@1885905b, last query executed 475,142ms ago
[2023-08-30 13:11:58:006] [FINEST  ] [       pool-3-thread-1 ] DataSourceBean$DataSourceMDConfigBean.executeWatchdog(): execution of watchdog for data source default
[2023-08-30 13:15:58:006] [FINEST  ] [       pool-3-thread-1 ] DataSourceBean$DataSourceMDConfigBean.executeWatchdog(): execution of watchdog for data source default
[2023-08-30 13:15:58:006] [FINEST  ] [       pool-3-thread-1 ] DataRepositoryImpl.checkConnectivity(): checking connection status tigase.db.jdbc.DataRepositoryImpl@386f88b3, last query executed 343,790ms ago
[2023-08-30 13:15:58:007] [FINEST  ] [       pool-3-thread-1 ] DataRepositoryImpl.checkConnectivity(): checking connection status tigase.db.jdbc.DataRepositoryImpl@1885905b, last query executed 343,788ms ago
[2023-08-30 13:19:58:006] [FINEST  ] [       pool-3-thread-1 ] DataSourceBean$DataSourceMDConfigBean.executeWatchdog(): execution of watchdog for data source default
[2023-08-30 13:23:58:006] [FINEST  ] [       pool-3-thread-1 ] DataSourceBean$DataSourceMDConfigBean.executeWatchdog(): execution of watchdog for data source default
[2023-08-30 13:23:58:006] [FINEST  ] [       pool-3-thread-1 ] DataRepositoryImpl.checkConnectivity(): checking connection status tigase.db.jdbc.DataRepositoryImpl@386f88b3, last query executed 480,000ms ago
[2023-08-30 13:23:58:007] [FINEST  ] [       pool-3-thread-1 ] DataRepositoryImpl.checkConnectivity(): checking connection status tigase.db.jdbc.DataRepositoryImpl@1885905b, last query executed 480,000ms ago
[2023-08-30 13:27:58:006] [FINEST  ] [       pool-3-thread-1 ] DataSourceBean$DataSourceMDConfigBean.executeWatchdog(): execution of watchdog for data source default
[2023-08-30 13:31:58:006] [FINEST  ] [       pool-3-thread-1 ] DataSourceBean$DataSourceMDConfigBean.executeWatchdog(): execution of watchdog for data source default
[2023-08-30 13:31:58:006] [FINEST  ] [       pool-3-thread-1 ] DataRepositoryImpl.checkConnectivity(): checking connection status tigase.db.jdbc.DataRepositoryImpl@386f88b3, last query executed 480,000ms ago
[2023-08-30 13:31:58:007] [FINEST  ] [       pool-3-thread-1 ] DataRepositoryImpl.checkConnectivity(): checking connection status tigase.db.jdbc.DataRepositoryImpl@1885905b, last query executed 480,000ms ago
[2023-08-30 13:35:58:006] [FINEST  ] [       pool-3-thread-1 ] DataSourceBean$DataSourceMDConfigBean.executeWatchdog(): execution of watchdog for data source default

with dataSource -> default -> 'pool-size' = '10' there are 7 entries in the logs

[2023-08-30 13:43:32:594] [FINEST  ] [       pool-3-thread-1 ] DataSourceBean$DataSourceMDConfigBean.executeWatchdog(): execution of watchdog for data source default
[2023-08-30 13:47:32:594] [FINEST  ] [       pool-3-thread-1 ] DataSourceBean$DataSourceMDConfigBean.executeWatchdog(): execution of watchdog for data source default
[2023-08-30 13:47:32:594] [FINEST  ] [       pool-3-thread-1 ] DataRepositoryImpl.checkConnectivity(): checking connection status tigase.db.jdbc.DataRepositoryImpl@52463255, last query executed 463,492ms ago
[2023-08-30 13:47:32:600] [FINEST  ] [       pool-3-thread-1 ] DataRepositoryImpl.checkConnectivity(): checking connection status tigase.db.jdbc.DataRepositoryImpl@53f94afe, last query executed 465,772ms ago
[2023-08-30 13:47:32:600] [FINEST  ] [       pool-3-thread-1 ] DataRepositoryImpl.checkConnectivity(): checking connection status tigase.db.jdbc.DataRepositoryImpl@51097500, last query executed 463,070ms ago
[2023-08-30 13:47:32:601] [FINEST  ] [       pool-3-thread-1 ] DataRepositoryImpl.checkConnectivity(): checking connection status tigase.db.jdbc.DataRepositoryImpl@46507dda, last query executed 466,959ms ago
[2023-08-30 13:47:32:602] [FINEST  ] [       pool-3-thread-1 ] DataRepositoryImpl.checkConnectivity(): checking connection status tigase.db.jdbc.DataRepositoryImpl@66f5770e, last query executed 463,119ms ago
[2023-08-30 13:47:32:602] [FINEST  ] [       pool-3-thread-1 ] DataRepositoryImpl.checkConnectivity(): checking connection status tigase.db.jdbc.DataRepositoryImpl@386f88b3, last query executed 463,060ms ago
[2023-08-30 13:47:32:603] [FINEST  ] [       pool-3-thread-1 ] DataRepositoryImpl.checkConnectivity(): checking connection status tigase.db.jdbc.DataRepositoryImpl@1885905b, last query executed 474,829ms ago
[2023-08-30 13:51:32:594] [FINEST  ] [       pool-3-thread-1 ] DataSourceBean$DataSourceMDConfigBean.executeWatchdog(): execution of watchdog for data source default
[2023-08-30 13:55:32:594] [FINEST  ] [       pool-3-thread-1 ] DataSourceBean$DataSourceMDConfigBean.executeWatchdog(): execution of watchdog for data source default
[2023-08-30 13:55:32:594] [FINEST  ] [       pool-3-thread-1 ] DataRepositoryImpl.checkConnectivity(): checking connection status tigase.db.jdbc.DataRepositoryImpl@52463255, last query executed 343,107ms ago
[2023-08-30 13:55:32:595] [FINEST  ] [       pool-3-thread-1 ] DataRepositoryImpl.checkConnectivity(): checking connection status tigase.db.jdbc.DataRepositoryImpl@53f94afe, last query executed 343,122ms ago
[2023-08-30 13:55:32:596] [FINEST  ] [       pool-3-thread-1 ] DataRepositoryImpl.checkConnectivity(): checking connection status tigase.db.jdbc.DataRepositoryImpl@51097500, last query executed 343,120ms ago
[2023-08-30 13:55:32:596] [FINEST  ] [       pool-3-thread-1 ] DataRepositoryImpl.checkConnectivity(): checking connection status tigase.db.jdbc.DataRepositoryImpl@46507dda, last query executed 343,119ms ago
[2023-08-30 13:55:32:596] [FINEST  ] [       pool-3-thread-1 ] DataRepositoryImpl.checkConnectivity(): checking connection status tigase.db.jdbc.DataRepositoryImpl@66f5770e, last query executed 343,118ms ago
[2023-08-30 13:55:32:597] [FINEST  ] [       pool-3-thread-1 ] DataRepositoryImpl.checkConnectivity(): checking connection status tigase.db.jdbc.DataRepositoryImpl@386f88b3, last query executed 343,117ms ago
[2023-08-30 13:55:32:597] [FINEST  ] [       pool-3-thread-1 ] DataRepositoryImpl.checkConnectivity(): checking connection status tigase.db.jdbc.DataRepositoryImpl@1885905b, last query executed 343,115ms ago
[2023-08-30 13:59:32:594] [FINEST  ] [       pool-3-thread-1 ] DataSourceBean$DataSourceMDConfigBean.executeWatchdog(): execution of watchdog for data source default
[2023-08-30 14:03:32:594] [FINEST  ] [       pool-3-thread-1 ] DataSourceBean$DataSourceMDConfigBean.executeWatchdog(): execution of watchdog for data source default
[2023-08-30 14:03:32:594] [FINEST  ] [       pool-3-thread-1 ] DataRepositoryImpl.checkConnectivity(): checking connection status tigase.db.jdbc.DataRepositoryImpl@52463255, last query executed 480,000ms ago
[2023-08-30 14:03:32:595] [FINEST  ] [       pool-3-thread-1 ] DataRepositoryImpl.checkConnectivity(): checking connection status tigase.db.jdbc.DataRepositoryImpl@53f94afe, last query executed 480,000ms ago
[2023-08-30 14:03:32:595] [FINEST  ] [       pool-3-thread-1 ] DataRepositoryImpl.checkConnectivity(): checking connection status tigase.db.jdbc.DataRepositoryImpl@51097500, last query executed 479,999ms ago
[2023-08-30 14:03:32:596] [FINEST  ] [       pool-3-thread-1 ] DataRepositoryImpl.checkConnectivity(): checking connection status tigase.db.jdbc.DataRepositoryImpl@46507dda, last query executed 480,000ms ago
[2023-08-30 14:03:32:596] [FINEST  ] [       pool-3-thread-1 ] DataRepositoryImpl.checkConnectivity(): checking connection status tigase.db.jdbc.DataRepositoryImpl@66f5770e, last query executed 480,000ms ago
[2023-08-30 14:03:32:596] [FINEST  ] [       pool-3-thread-1 ] DataRepositoryImpl.checkConnectivity(): checking connection status tigase.db.jdbc.DataRepositoryImpl@386f88b3, last query executed 479,999ms ago
[2023-08-30 14:03:32:597] [FINEST  ] [       pool-3-thread-1 ] DataRepositoryImpl.checkConnectivity(): checking connection status tigase.db.jdbc.DataRepositoryImpl@1885905b, last query executed 480,000ms ago
[2023-08-30 14:07:32:594] [FINEST  ] [       pool-3-thread-1 ] DataSourceBean$DataSourceMDConfigBean.executeWatchdog(): execution of watchdog for data source default
[2023-08-30 14:11:32:594] [FINEST  ] [       pool-3-thread-1 ] DataSourceBean$DataSourceMDConfigBean.executeWatchdog(): execution of watchdog for data source default
[2023-08-30 14:11:32:594] [FINEST  ] [       pool-3-thread-1 ] DataRepositoryImpl.checkConnectivity(): checking connection status tigase.db.jdbc.DataRepositoryImpl@52463255, last query executed 480,000ms ago
[2023-08-30 14:11:32:595] [FINEST  ] [       pool-3-thread-1 ] DataRepositoryImpl.checkConnectivity(): checking connection status tigase.db.jdbc.DataRepositoryImpl@53f94afe, last query executed 480,000ms ago
[2023-08-30 14:11:32:595] [FINEST  ] [       pool-3-thread-1 ] DataRepositoryImpl.checkConnectivity(): checking connection status tigase.db.jdbc.DataRepositoryImpl@51097500, last query executed 480,000ms ago
[2023-08-30 14:11:32:596] [FINEST  ] [       pool-3-thread-1 ] DataRepositoryImpl.checkConnectivity(): checking connection status tigase.db.jdbc.DataRepositoryImpl@46507dda, last query executed 480,000ms ago
[2023-08-30 14:11:32:596] [FINEST  ] [       pool-3-thread-1 ] DataRepositoryImpl.checkConnectivity(): checking connection status tigase.db.jdbc.DataRepositoryImpl@66f5770e, last query executed 480,000ms ago
[2023-08-30 14:11:32:596] [FINEST  ] [       pool-3-thread-1 ] DataRepositoryImpl.checkConnectivity(): checking connection status tigase.db.jdbc.DataRepositoryImpl@386f88b3, last query executed 480,000ms ago
[2023-08-30 14:11:32:597] [FINEST  ] [       pool-3-thread-1 ] DataRepositoryImpl.checkConnectivity(): checking connection status tigase.db.jdbc.DataRepositoryImpl@1885905b, last query executed 480,000ms ago

The config is:

dataSource {
    default () {
        'pool-size' = '10'
        'watchdog-frequency' = 'PT4M'
    }
}
userRepository {
    default () {
        'watchdog-frequency' = 'PT4M'
    }
}
authRepository {
    default () {
      'watchdog-frequency' = 'PT4M'
    }
}
davidemarrone commented 1 year ago

To try to understand the problem I have left an instance of the cluster running with 10 connections and without access from the clients logging all the traffic on 3306 with:

tcpdump -n -s0 port 3306 -w out.pcap

Then I have extracted the streams of each the 10 connections with tshark:

for i in $(seq 0 9); do echo "stream $i"; tshark -r out.pcap -Y 'tcp.stream eq '$i' && mysql.query != ""' -T fields  -e frame.time -e mysql.query  > streams/$i.txt ; done

in this ways I have the exact logs of what happens on each mysql connection extracted from the wire.

What I found is that on the idle connections of the pool, where there is no mysql traffic, the watchdog is working but it seems that it use the double of the values configured on watchdog-frequency, from the traffic I see the select 1 every 8 minutes with the configuration 'watchdog-frequency' = 'PT4M' for example:

Aug 30, 2023 22:26:01.462827000 CEST    select 1
Aug 30, 2023 22:34:01.463039000 CEST    select 1
Aug 30, 2023 22:42:01.462689000 CEST    select 1
Aug 30, 2023 22:50:01.468309000 CEST    select 1
Aug 30, 2023 22:58:01.462660000 CEST    select 1
Aug 30, 2023 23:06:01.462794000 CEST    select 1
Aug 30, 2023 23:14:01.462622000 CEST    select 1
Aug 30, 2023 23:22:01.462915000 CEST    select 1
Aug 30, 2023 23:30:01.462836000 CEST    select 1
Aug 30, 2023 23:38:01.463022000 CEST    select 1
Aug 30, 2023 23:46:01.463433000 CEST    select 1

instead in the stream of the connection where there is mysql traffic I see a select 1 every minute, that is strange but in 3 different streams it seems that there are scheduled task with these logs:

1:

Aug 31, 2023 16:19:15.596559000 CEST    select 1
Aug 31, 2023 16:19:15.598961000 CEST    CALL TigAllUsersCount() 
Aug 31, 2023 16:20:08.550052000 CEST    select nid as nid1 from tig_nodes where (uid = 6) AND (parent_nid is null) AND (node = 'root')
Aug 31, 2023 16:20:15.596342000 CEST    select 1
Aug 31, 2023 16:20:15.598472000 CEST    CALL TigAllUsersCount() 
Aug 31, 2023 16:21:08.551253000 CEST    select nid as nid1 from tig_nodes where (uid = 6) AND (parent_nid is null) AND (node = 'root')
Aug 31, 2023 16:21:15.596732000 CEST    select 1
Aug 31, 2023 16:21:15.598943000 CEST    CALL TigAllUsersCount() 
Aug 31, 2023 16:22:08.550469000 CEST    select nid as nid1 from tig_nodes where (uid = 6) AND (parent_nid is null) AND (node = 'root')
Aug 31, 2023 16:22:15.596559000 CEST    select 1

2:

Aug 31, 2023 16:26:34.900282000 CEST    select 1
Aug 31, 2023 16:26:34.903945000 CEST    update tig_cluster_nodes set hostname= 'testHOSTNAME', secondary= 'testHOSTNAME', password= 'HASHEDPASSWORD', last_update = '2023-08-31 14:26:34.903',port= 5277, cpu_usage= 0.16130766, mem_usage= 0.61106557  where hostname= 'testHOSTNAME'
Aug 31, 2023 16:26:34.908116000 CEST    insert into tig_cluster_nodes (hostname, secondary, password, last_update, port, cpu_usage, mem_usage)  (select 'testHOSTNAME', 'testHOSTNAME', 'HASHEDPASSWORD', '2023-08-31 14:26:34.903', 5277, 0.16130766, 0.61106557 from tig_cluster_nodes WHERE hostname='testHOSTNAME' HAVING count(*)=0)
Aug 31, 2023 16:26:34.910158000 CEST    select hostname, secondary, password, last_update, port, cpu_usage, mem_usage from tig_cluster_nodes
Aug 31, 2023 16:26:49.900293000 CEST    update tig_cluster_nodes set hostname= 'testHOSTNAME', secondary= 'testHOSTNAME', password= 'HASHEDPASSWORD', last_update = '2023-08-31 14:26:49.9',port= 5277, cpu_usage= 0.23255815, mem_usage= 0.61106557  where hostname= 'testHOSTNAME'
Aug 31, 2023 16:26:49.904794000 CEST    insert into tig_cluster_nodes (hostname, secondary, password, last_update, port, cpu_usage, mem_usage)  (select 'testHOSTNAME', 'testHOSTNAME', 'HASHEDPASSWORD', '2023-08-31 14:26:49.9', 5277, 0.23255815, 0.61106557 from tig_cluster_nodes WHERE hostname='testHOSTNAME' HAVING count(*)=0)
Aug 31, 2023 16:26:49.906823000 CEST    select hostname, secondary, password, last_update, port, cpu_usage, mem_usage from tig_cluster_nodes
Aug 31, 2023 16:27:04.900972000 CEST    update tig_cluster_nodes set hostname= 'testHOSTNAME', secondary= 'testHOSTNAME', password= 'HASHEDPASSWORD', last_update = '2023-08-31 14:27:04.9',port= 5277, cpu_usage= 0.111656986, mem_usage= 0.61106557  where hostname= 'testHOSTNAME'
Aug 31, 2023 16:27:04.905272000 CEST    insert into tig_cluster_nodes (hostname, secondary, password, last_update, port, cpu_usage, mem_usage)  (select 'testHOSTNAME', 'testHOSTNAME', 'HASHEDPASSWORD', '2023-08-31 14:27:04.9', 5277, 0.111656986, 0.61106557 from tig_cluster_nodes WHERE hostname='testHOSTNAME' HAVING count(*)=0)
Aug 31, 2023 16:27:04.907095000 CEST    select hostname, secondary, password, last_update, port, cpu_usage, mem_usage from tig_cluster_nodes
Aug 31, 2023 16:27:19.901170000 CEST    update tig_cluster_nodes set hostname= 'testHOSTNAME', secondary= 'testHOSTNAME', password= 'HASHEDPASSWORD', last_update = '2023-08-31 14:27:19.901',port= 5277, cpu_usage= 0.23250407, mem_usage= 0.61106557  where hostname= 'testHOSTNAME'
Aug 31, 2023 16:27:19.905292000 CEST    insert into tig_cluster_nodes (hostname, secondary, password, last_update, port, cpu_usage, mem_usage)  (select 'testHOSTNAME', 'testHOSTNAME', 'HASHEDPASSWORD', '2023-08-31 14:27:19.901', 5277, 0.23250407, 0.61106557 from tig_cluster_nodes WHERE hostname='testHOSTNAME' HAVING count(*)=0)
Aug 31, 2023 16:27:19.907239000 CEST    select hostname, secondary, password, last_update, port, cpu_usage, mem_usage from tig_cluster_nodes
Aug 31, 2023 16:27:34.900557000 CEST    select 1
Aug 31, 2023 16:27:34.902963000 CEST    update tig_cluster_nodes set hostname= 'testHOSTNAME', secondary= 'testHOSTNAME', password= 'HASHEDPASSWORD', last_update = '2023-08-31 14:27:34.902',port= 5277, cpu_usage= 0.10753845, mem_usage= 0.61106557  where hostname= 'testHOSTNAME'
Aug 31, 2023 16:27:34.907235000 CEST    insert into tig_cluster_nodes (hostname, secondary, password, last_update, port, cpu_usage, mem_usage)  (select 'testHOSTNAME', 'testHOSTNAME', 'HASHEDPASSWORD', '2023-08-31 14:27:34.902', 5277, 0.10753845, 0.61106557 from tig_cluster_nodes WHERE hostname='testHOSTNAME' HAVING count(*)=0)
Aug 31, 2023 16:27:34.909258000 CEST    select hostname, secondary, password, last_update, port, cpu_usage, mem_usage from tig_cluster_nodes
Aug 31, 2023 16:27:49.900470000 CEST    update tig_cluster_nodes set hostname= 'testHOSTNAME', secondary= 'testHOSTNAME', password= 'HASHEDPASSWORD', last_update = '2023-08-31 14:27:49.9',port= 5277, cpu_usage= 4.1860466, mem_usage= 0.6149776  where hostname= 'testHOSTNAME'
Aug 31, 2023 16:27:49.904426000 CEST    insert into tig_cluster_nodes (hostname, secondary, password, last_update, port, cpu_usage, mem_usage)  (select 'testHOSTNAME', 'testHOSTNAME', 'HASHEDPASSWORD', '2023-08-31 14:27:49.9', 5277, 4.1860466, 0.6149776 from tig_cluster_nodes WHERE hostname='testHOSTNAME' HAVING count(*)=0)
Aug 31, 2023 16:27:49.906330000 CEST    select hostname, secondary, password, last_update, port, cpu_usage, mem_usage from tig_cluster_nodes
Aug 31, 2023 16:28:04.900826000 CEST    update tig_cluster_nodes set hostname= 'testHOSTNAME', secondary= 'testHOSTNAME', password= 'HASHEDPASSWORD', last_update = '2023-08-31 14:28:04.9',port= 5277, cpu_usage= 0.111656986, mem_usage= 0.6149776  where hostname= 'testHOSTNAME'
Aug 31, 2023 16:28:04.905136000 CEST    insert into tig_cluster_nodes (hostname, secondary, password, last_update, port, cpu_usage, mem_usage)  (select 'testHOSTNAME', 'testHOSTNAME', 'HASHEDPASSWORD', '2023-08-31 14:28:04.9', 5277, 0.111656986, 0.6149776 from tig_cluster_nodes WHERE hostname='testHOSTNAME' HAVING count(*)=0)
Aug 31, 2023 16:28:04.907056000 CEST    select hostname, secondary, password, last_update, port, cpu_usage, mem_usage from tig_cluster_nodes
Aug 31, 2023 16:28:19.901094000 CEST    update tig_cluster_nodes set hostname= 'testHOSTNAME', secondary= 'testHOSTNAME', password= 'HASHEDPASSWORD', last_update = '2023-08-31 14:28:19.901',port= 5277, cpu_usage= 0.23255815, mem_usage= 0.6149776  where hostname= 'testHOSTNAME'
Aug 31, 2023 16:28:19.905489000 CEST    insert into tig_cluster_nodes (hostname, secondary, password, last_update, port, cpu_usage, mem_usage)  (select 'testHOSTNAME', 'testHOSTNAME', 'HASHEDPASSWORD', '2023-08-31 14:28:19.901', 5277, 0.23255815, 0.6149776 from tig_cluster_nodes WHERE hostname='testHOSTNAME' HAVING count(*)=0)
Aug 31, 2023 16:28:19.907409000 CEST    select hostname, secondary, password, last_update, port, cpu_usage, mem_usage from tig_cluster_nodes
Aug 31, 2023 16:28:34.900660000 CEST    select 1
Aug 31, 2023 16:28:34.902812000 CEST    update tig_cluster_nodes set hostname= 'testHOSTNAME', secondary= 'testHOSTNAME', password= 'HASHEDPASSWORD', last_update = '2023-08-31 14:28:34.902',port= 5277, cpu_usage= 0.10753845, mem_usage= 0.6149776  where hostname= 'testHOSTNAME'
Aug 31, 2023 16:28:34.906663000 CEST    insert into tig_cluster_nodes (hostname, secondary, password, last_update, port, cpu_usage, mem_usage)  (select 'testHOSTNAME', 'testHOSTNAME', 'HASHEDPASSWORD', '2023-08-31 14:28:34.902', 5277, 0.10753845, 0.6149776 from tig_cluster_nodes WHERE hostname='testHOSTNAME' HAVING count(*)=0)
Aug 31, 2023 16:28:34.908863000 CEST    select hostname, secondary, password, last_update, port, cpu_usage, mem_usage from tig_cluster_nodes
Aug 31, 2023 16:28:49.900281000 CEST    update tig_cluster_nodes set hostname= 'testHOSTNAME', secondary= 'testHOSTNAME', password= 'HASHEDPASSWORD', last_update = '2023-08-31 14:28:49.9',port= 5277, cpu_usage= 0.11630612, mem_usage= 0.6149776  where hostname= 'testHOSTNAME'
Aug 31, 2023 16:28:49.904726000 CEST    insert into tig_cluster_nodes (hostname, secondary, password, last_update, port, cpu_usage, mem_usage)  (select 'testHOSTNAME', 'testHOSTNAME', 'HASHEDPASSWORD', '2023-08-31 14:28:49.9', 5277, 0.11630612, 0.6149776 from tig_cluster_nodes WHERE hostname='testHOSTNAME' HAVING count(*)=0)
Aug 31, 2023 16:28:49.907835000 CEST    select hostname, secondary, password, last_update, port, cpu_usage, mem_usage from tig_cluster_nodes
Aug 31, 2023 16:29:04.901131000 CEST    update tig_cluster_nodes set hostname= 'testHOSTNAME', secondary= 'testHOSTNAME', password= 'HASHEDPASSWORD', last_update = '2023-08-31 14:29:04.901',port= 5277, cpu_usage= 0.0, mem_usage= 0.6149776  where hostname= 'testHOSTNAME'
Aug 31, 2023 16:29:04.905193000 CEST    insert into tig_cluster_nodes (hostname, secondary, password, last_update, port, cpu_usage, mem_usage)  (select 'testHOSTNAME', 'testHOSTNAME', 'HASHEDPASSWORD', '2023-08-31 14:29:04.901', 5277, 0.0, 0.6149776 from tig_cluster_nodes WHERE hostname='testHOSTNAME' HAVING count(*)=0)
Aug 31, 2023 16:29:04.907183000 CEST    select hostname, secondary, password, last_update, port, cpu_usage, mem_usage from tig_cluster_nodes
Aug 31, 2023 16:29:19.900397000 CEST    update tig_cluster_nodes set hostname= 'testHOSTNAME', secondary= 'testHOSTNAME', password= 'HASHEDPASSWORD', last_update = '2023-08-31 14:29:19.9',port= 5277, cpu_usage= 0.0, mem_usage= 0.6149776  where hostname= 'testHOSTNAME'
Aug 31, 2023 16:29:19.904640000 CEST    insert into tig_cluster_nodes (hostname, secondary, password, last_update, port, cpu_usage, mem_usage)  (select 'testHOSTNAME', 'testHOSTNAME', 'HASHEDPASSWORD', '2023-08-31 14:29:19.9', 5277, 0.0, 0.6149776 from tig_cluster_nodes WHERE hostname='testHOSTNAME' HAVING count(*)=0)
Aug 31, 2023 16:29:19.906553000 CEST    select hostname, secondary, password, last_update, port, cpu_usage, mem_usage from tig_cluster_nodes
Aug 31, 2023 16:29:34.900852000 CEST    select 1

3:

Aug 31, 2023 09:42:08.357990000 CEST    select 1
Aug 31, 2023 09:42:08.360018000 CEST    CALL TigGetUserDBUid('vhost-manager') 
Aug 31, 2023 09:42:08.364809000 CEST    select pval from tig_pairs where (nid = 6) AND (pkey = 'vhosts-lists')
Aug 31, 2023 09:43:08.357765000 CEST    select 1
Aug 31, 2023 09:43:08.359838000 CEST    CALL TigGetUserDBUid('vhost-manager') 
Aug 31, 2023 09:43:08.365848000 CEST    select pval from tig_pairs where (nid = 6) AND (pkey = 'vhosts-lists')
Aug 31, 2023 09:44:08.357575000 CEST    select 1
Aug 31, 2023 09:44:08.359731000 CEST    CALL TigGetUserDBUid('vhost-manager') 
Aug 31, 2023 09:44:08.364440000 CEST    select pval from tig_pairs where (nid = 6) AND (pkey = 'vhosts-lists')
Aug 31, 2023 09:45:08.357410000 CEST    select 1
davidemarrone commented 1 year ago

Using 'watchdog-frequency' = 'PT2M' I see the SELECT 1 every 4 minutes, unfortunately the proxy still drop the connection after 24 hours. It seems a bug on the RDS proxy, I will open a ticket on AWS ...

Do you have any ETA for the new version that support driver swapping so I can use the AWS jdbc driver?

davidemarrone commented 1 year ago

bad news from AWS:

I would like to mention that the rds proxy connection get closed after 24 hours is expected behavior. RDS Proxy won't close connection when there is ongoing transaction. But once the transaction is completed and the database connection age has passed 24 hours, it will be added to the termination queue.

Soon it will be closed again where its not within a transaction. When 24 hrs time is reached, the proxy checks if the client connection is in transaction. If so, the proxy waits at most 15 more mins. If after 15 mins the client connections is still in transaction, then proxy would drop the client connection regardless." .This is expected behavior due to architectural design which is similar in Aurora serverless.

"Aurora Serverless v1 closes connections that are older than 24 hours." [+] https://aws.amazon.com/blogs/database/best-practices-for-working-with-amazon-aurora-serverless/

As you are looking for more clarification about the below in the documentation above mentioned is not updated in our public documentation, but will be updated soon. As this is required to undergo a change process, I am not in a position to provide the ETA on when the document will be updated.

so at the moment tigase does not support in any way AWS aurora with HA, there is no way to make it working correctly, could you please re-think to the priority that you have assigned to ticket server-1358?

An alternative fix can be to refresh the connections as described in the doc in the link of Aurora Serverless v1:

Aurora Serverless v1 closes connections that are older than 24 hours. Make sure that your connection pool refreshes connections frequently.

woj-tek commented 1 year ago

@davidemarrone could you share output of the query: SHOW GLOBAL VARIABLES LIKE '%read_only%'; on read-only RDS instance?

davidemarrone commented 1 year ago

on the slave - RO:

mysql> SHOW GLOBAL VARIABLES LIKE '%read_only%'; 
+-----------------------+-------+
| Variable_name         | Value |
+-----------------------+-------+
| innodb_read_only      | ON    |
| read_only             | OFF   |
| super_read_only       | OFF   |
| transaction_read_only | OFF   |
+-----------------------+-------+
4 rows in set (0.01 sec)

on the master - RW:

SHOW GLOBAL VARIABLES LIKE '%read_only%'; 
+-----------------------+-------+
| Variable_name         | Value |
+-----------------------+-------+
| innodb_read_only      | OFF   |
| read_only             | OFF   |
| super_read_only       | OFF   |
| transaction_read_only | OFF   |
+-----------------------+-------+
4 rows in set (0.00 sec)
woj-tek commented 1 year ago

@davidemarrone I added handling of this case (checking if database read-only using JDBC API and also additional workaround for AWS RDS as they set different variable than the one that the MySQL Driver checks…). It's in the current master branch and will be released in Tigase Server 8.4.

davidemarrone commented 1 year ago

Ok, thank you