odpi / egeria

Egeria core
https://egeria-project.org
Apache License 2.0
812 stars 261 forks source link

[BUG] Cannot shutdown integration service #7145

Closed planetf1 closed 1 year ago

planetf1 commented 2 years ago

Is there an existing issue for this?

Current Behavior

When an integration service is running:

One integration service defined for integration daemon intserv

Integration Service: Database Integrator OMIS
  Integration Connector Reports: 
      Integration Connector: jdbcresource
         status:                   RUNNING
         lastStatusChange:         2022-11-29T12:01:11.303+00:00
         minMinutesBetweenRefresh: 60

is running within this 'intserv server', but it is not possible to restart it by calling ..../instance - this works for most servers normally

The notebook function 'activateServerOnPlatform' REST call will hang

The log entries show the request is clearly received, but not actioned:

➜  egeria curl -k -X POST https://localhost:9443/open-metadata/admin-services/users/garygeeke/servers/intserv/instance
Tue Nov 29 12:37:52 GMT 2022 intserv Shutdown OMAG-ADMIN-0005 The intserv server has begun the shutdown process
Tue Nov 29 12:37:52 GMT 2022 intserv Shutdown INTEGRATION-DAEMON-SERVICES-0020 The integration daemon intserv is shutting down

The request does not return

Repeating the status query shows the server and it's integration connector is still running -- and has not restarted

Expected Behavior

When calling /instance :

Steps To Reproduce

Environment

- Egeria:
- OS:
- Java:
- Browser (for UI issues):
- Additional connectors and integration:

Any Further Information?

No response

planetf1 commented 2 years ago

This also occurs when explicitly trying to stop the instance

curl -k -X DELETE https://localhost:9443/open-metadata/admin-services/users/garygeeke/servers/intserv/instance

This also results in the same log entries, and also hangs

planetf1 commented 2 years ago

I tried deleting the server that was supporting the integration service (in this case 'dbisrv').

This then allowed a further call to shutdown this integration service and complete - and provided clearer indication as to the problem:

First we shutdown the supporting metadata access point:

➜  ~ curl -k -X DELETE https://localhost:9443/open-metadata/admin-services/users/garygeeke/servers/dbisrv/instance

Good -- see it really is:

➜  ~ curl -k -X DELETE https://localhost:9443/open-metadata/admin-services/users/garygeeke/servers/dbisrv/instance
{"class":"VoidResponse","relatedHTTPCode":404,"exceptionClassName":"org.odpi.openmetadata.frameworks.connectors.ffdc.InvalidParameterException","actionDescription":"deactivateTemporarily","exceptionErrorMessage":"OMAG-MULTI-TENANT-404-001 The OMAG Server dbisrv is not available to service a request from user garygeeke","exceptionErrorMessageId":"OMAG-MULTI-TENANT-404-001","exceptionErrorMessageParameters":["dbisrv","garygeeke"],"exceptionSystemAction":"The system is unable to process the request because the server is not running on the called platform.","exceptionUserAction":"Verify that the correct server is being called on the correct platform and that this server is running. Retry the request when the server is available.","exceptionProperties":{"serverName":"dbisrv","parameterName":"serverName"}}%

and in the logs we see

Tue Nov 29 13:02:17 GMT 2022 dbisrv Shutdown OMAG-ADMIN-0006 The dbisrv server has completed shutdown

So now looking back at this 'stuck' server, we try to shutdown (I included logs)

➜  ~ curl -k -X DELETE https://localhost:9443/open-metadata/admin-services/users/garygeeke/servers/intserv/instance
Tue Nov 29 13:09:50 GMT 2022 intserv Shutdown OMAG-ADMIN-0005 The intserv server has begun the shutdown process
Tue Nov 29 13:09:50 GMT 2022 intserv Shutdown INTEGRATION-DAEMON-SERVICES-0020 The integration daemon intserv is shutting down
2022-11-29 13:09:51.540  WARN 96227 --- [ionDaemonThread] o.o.o.f.auditlog.MessageFormatter        : Missing parameter for message An SQL exception was received by method {0}. Exception message is: {1}
Tue Nov 29 13:09:51 GMT 2022 intserv Exception JDBC-INTEGRATION-CONNECTOR-0005 An SQL exception was received by method connect. Exception message is: {1}
Tue Nov 29 13:09:51 GMT 2022 intserv Exception JDBC-INTEGRATION-CONNECTOR-0005 Supplementary information: log record id 7c43462e-1fd2-43fd-b6f9-93578318bad2 com.ibm.db2.jcc.am.DisconnectNonTransientConnectionException returned message of [jcc][t4][2030][11211][4.32.28] A communication error occurred during operations on the connection's underlying socket, socket input stream,
or socket output stream.  Error location: Reply.fill() - socketInputStream.read (-1).  Message: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target. ERRORCODE=-4499, SQLSTATE=08001 and stacktrace of
com.ibm.db2.jcc.am.DisconnectNonTransientConnectionException: [jcc][t4][2030][11211][4.32.28] A communication error occurred during operations on the connection's underlying socket, socket input stream,
or socket output stream.  Error location: Reply.fill() - socketInputStream.read (-1).  Message: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target. ERRORCODE=-4499, SQLSTATE=08001
    at com.ibm.db2.jcc.am.b7.a(b7.java:338)
    at com.ibm.db2.jcc.t4.a.a(a.java:577)
    at com.ibm.db2.jcc.t4.a.a(a.java:561)
    at com.ibm.db2.jcc.t4.a.a(a.java:556)
    at com.ibm.db2.jcc.t4.a.b(a.java:1727)
    at com.ibm.db2.jcc.t4.a.a(a.java:1767)
    at com.ibm.db2.jcc.t4.a.h(a.java:503)
    at com.ibm.db2.jcc.t4.a.flush_(a.java:420)
    at com.ibm.db2.jcc.am.Agent.flowOutsideUOW(Agent.java:281)
    at com.ibm.db2.jcc.t4.b.b(b.java:1357)
    at com.ibm.db2.jcc.t4.b.b(b.java:891)
    at com.ibm.db2.jcc.t4.b.a(b.java:862)
    at com.ibm.db2.jcc.t4.b.a(b.java:457)
    at com.ibm.db2.jcc.t4.b.a(b.java:430)
    at com.ibm.db2.jcc.t4.b.<init>(b.java:368)
    at com.ibm.db2.jcc.DB2SimpleDataSource.getConnection(DB2SimpleDataSource.java:243)
    at com.ibm.db2.jcc.DB2SimpleDataSource.getConnection(DB2SimpleDataSource.java:200)
    at com.ibm.db2.jcc.DB2Driver.connect(DB2Driver.java:491)
    at com.ibm.db2.jcc.DB2Driver.connect(DB2Driver.java:117)
    at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:681)
    at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:229)
    at org.odpi.openmetadata.adapters.connectors.resource.jdbc.JdbcConnector$JdbcConnectorAsDataSource.getConnection(JdbcConnector.java:35)
    at org.odpi.openmetadata.adapters.connectors.integration.jdbc.JdbcIntegrationConnector.connect(JdbcIntegrationConnector.java:69)
    at org.odpi.openmetadata.adapters.connectors.integration.jdbc.JdbcIntegrationConnector.refresh(JdbcIntegrationConnector.java:42)
    at org.odpi.openmetadata.governanceservers.integrationdaemonservices.handlers.IntegrationConnectorHandler.refreshConnector(IntegrationConnectorHandler.java:487)
    at org.odpi.openmetadata.governanceservers.integrationdaemonservices.threads.IntegrationDaemonThread.run(IntegrationDaemonThread.java:96)
    at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: javax.net.ssl.SSLHandshakeException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
    at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:131)
    at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:371)
    at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:314)
    at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:309)
    at java.base/sun.security.ssl.CertificateMessage$T12CertificateConsumer.checkServerCerts(CertificateMessage.java:654)
    at java.base/sun.security.ssl.CertificateMessage$T12CertificateConsumer.onCertificate(CertificateMessage.java:473)
    at java.base/sun.security.ssl.CertificateMessage$T12CertificateConsumer.consume(CertificateMessage.java:369)
    at java.base/sun.security.ssl.SSLHandshake.consume(SSLHandshake.java:396)
    at java.base/sun.security.ssl.HandshakeContext.dispatch(HandshakeContext.java:480)
    at java.base/sun.security.ssl.HandshakeContext.dispatch(HandshakeContext.java:458)
    at java.base/sun.security.ssl.TransportContext.dispatch(TransportContext.java:201)
    at java.base/sun.security.ssl.SSLTransport.decode(SSLTransport.java:172)
    at java.base/sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1505)
    at java.base/sun.security.ssl.SSLSocketImpl.readHandshakeRecord(SSLSocketImpl.java:1420)
    at java.base/sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:455)
    at java.base/sun.security.ssl.SSLSocketImpl.ensureNegotiated(SSLSocketImpl.java:920)
    at java.base/sun.security.ssl.SSLSocketImpl$AppOutputStream.write(SSLSocketImpl.java:1290)
    at com.ibm.db2.jcc.t4.ae.b(ae.java:2069)
    at com.ibm.db2.jcc.t4.ae.a(ae.java:2011)
    at com.ibm.db2.jcc.t4.a.h(a.java:488)
    ... 20 more
Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
    at java.base/sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:439)
    at java.base/sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:306)
    at java.base/sun.security.validator.Validator.validate(Validator.java:264)
    at java.base/sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:231)
    at java.base/sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:132)
    at java.base/sun.security.ssl.CertificateMessage$T12CertificateConsumer.checkServerCerts(CertificateMessage.java:638)
    ... 35 more
Caused by: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
    at java.base/sun.security.provider.certpath.SunCertPathBuilder.build(SunCertPathBuilder.java:141)
    at java.base/sun.security.provider.certpath.SunCertPathBuilder.engineBuild(SunCertPathBuilder.java:126)
    at java.base/java.security.cert.CertPathBuilder.build(CertPathBuilder.java:297)
    at java.base/sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:434)
    ... 40 more

Tue Nov 29 13:09:51 GMT 2022 intserv Information JDBC-INTEGRATION-CONNECTOR-0001 Exiting from method JdbcIntegrationConnector.refresh as a result of a failed connection
Tue Nov 29 13:09:51 GMT 2022 intserv Information INTEGRATION-DAEMON-SERVICES-0043 The integration connector jdbcresource in integration daemon intserv has completed refresh processing in 648979 millisecond
Tue Nov 29 13:09:51 GMT 2022 intserv Shutdown INTEGRATION-DAEMON-SERVICES-0026 The integration daemon intserv has completed shutdown
Tue Nov 29 13:09:51 GMT 2022 intserv Shutdown INTEGRATION-DAEMON-SERVICES-0026 The integration daemon intserv has completed shutdown
Tue Nov 29 13:09:51 GMT 2022 intserv Shutdown OMRS-AUDIT-0010 The Open Metadata Repository Services (OMRS) is disconnecting the open metadata repositories
Tue Nov 29 13:09:51 GMT 2022 intserv Shutdown INTEGRATION-DAEMON-SERVICES-0026 The integration daemon intserv has completed shutdown
Tue Nov 29 13:09:51 GMT 2022 intserv Shutdown OMRS-AUDIT-0011 The Open Metadata Repository Services (OMRS) has disconnected from the open metadata repositories
{"class":"VoidResponse","relatedHTTPCode":200}Tue Nov 29 13:09:51 GMT 2022 intserv Shutdown OMRS-AUDIT-0010 The Open Metadata Repository Services (OMRS) is disconnecting the open metadata repositories
Tue Nov 29 13:09:51 GMT 2022 intserv Shutdown OMAG-ADMIN-0006 The intserv server has completed shutdown
Tue Nov 29 13:09:51 GMT 2022 intserv Shutdown OMRS-AUDIT-0010 The Open Metadata Repository Services (OMRS) is disconnecting the open metadata repositories
{"class":"VoidResponse","relatedHTTPCode":404,"exceptionClassName":"org.odpi.openmetadata.frameworks.connectors.ffdc.InvalidParameterException","actionDescription":"deactivateTemporarily","exceptionErrorMessage":"OMAG-MULTI-TENANT-404-001 The OMAG Server intserv is not available to service a request from user garygeeke","exceptionErrorMessageId":"OMAG-MULTI-TENANT-404-001","exceptionErrorMessageParameters":["intserv","garygeeke"],"exceptionSystemAction":"The system is unable to process the request because the server is not running on the called platform.","exceptionUserAction":"Verify that the correct server is being called on the correct platform and that this server is running. Retry the request when the server is available.","exceptionProperties":{"serverName":"intserv","parameterName":"serverName"}}%       Tue Nov 29 13:09:51 GMT 2022 intserv Shutdown OMRS-AUDIT-0011 The Open Metadata Repository Services (OMRS) has disconnected from the open metadata repositories
Tue Nov 29 13:09:51 GMT 2022 intserv Shutdown OMAG-ADMIN-0006 The intserv server has completed shutdown
Tue Nov 29 13:09:51 GMT 2022 intserv Shutdown OMRS-AUDIT-0011 The Open Metadata Repository Services (OMRS) has disconnected from the open metadata repositories
Tue Nov 29 13:09:51 GMT 2022 intserv Shutdown OMAG-ADMIN-0006 The intserv server has completed shutdown
[1]  - 97923 done       curl -k -X DELETE
➜  ~ Tue Nov 29 13:09:52 GMT 2022 intserv Shutdown INTEGRATION-DAEMON-SERVICES-0044 The integration daemon thread for integration daemon intserv is shutting down
➜  ~

So from this:

planetf1 commented 2 years ago

One user error in this case was appending ':sslConnection=true;' onto the jdbc URL (when configuring the connector), when it should have been ';sslConnection=true;'. However there is something going wrong in handling this case.

mandy-chessell commented 2 years ago

Looking at this issue, I checked the base class of the Integration Connector and it does not pass on the disconnect() call to its embedded connectors. Similar problem with:

I am not sure if this is the cause of the problem - but it certainly does not help because the embedded connectors don't know that the service is shutting down. I'll push a PR with this issue in its description and see if it helps ...

planetf1 commented 2 years ago

Still investigating root cause. the mention of the connection string being wrong above was incorrect -- if that is invalud, the db2 driver gets an exception, and that audit entry is correctly reported (though the connector still shows as running - see https://github.com/odpi/egeria-connector-jdbc/issues/5

I think a contributory factor HERE is that back with the original connection string, I am seeing the connector hang in db2. I will raise an issue in the jdbc repo on this.

How that is interrupted, whether it's interrupted during a shutdown could be a factor, but if we are hosting these services we need to do what we can to ensure they handle interruptions. Currently I'm having to restart the server chassis - ie the java process

planetf1 commented 2 years ago

It looks like a hung connection attempt - though the same URL is working on some standalone java code & a third party tool.

Where we have a risk of any threads we manage being stuck (especially with network connections - so I'd include kafka, our connectors etc) we may need to Thread.interrupt(that thread), and also take care we don't swallow any interrupts.

On this latter point actually I had this issue in the kafka connector where we needed to ensure we reraised the interrupt so that it didn't get lost, so I think that might be slightly better (but still worthy of review). (ref https://www.baeldung.com/java-interrupted-exception )

planetf1 commented 2 years ago

Ok, I now see that the default timeout of that connection attempt is 10 minutes (it is configurable as a connection string property - and this feels rather long)

During that time the connector is not shutting down -- so useful to get into this error case so we can address here & check interrupt handling elsewhere

Here's the bigger picture:

Tue Nov 29 17:51:42 GMT 2022 intserv Startup OMRS-AUDIT-0064 The Open Metadata Repository Services (OMRS) has initialized the audit log for the Integration Daemon called intserv
Tue Nov 29 17:51:42 GMT 2022 intserv Startup OMAG-ADMIN-0001 The intserv server is configured with a max page size of 1000
Tue Nov 29 17:51:42 GMT 2022 intserv Startup OMRS-AUDIT-0001 The Open Metadata Repository Services (OMRS) is initializing the subsystems to support a new server
Tue Nov 29 17:51:42 GMT 2022 intserv Startup OMRS-AUDIT-0007 The Open Metadata Repository Services (OMRS) has initialized
Tue Nov 29 17:51:42 GMT 2022 intserv Startup OMAG-ADMIN-0100 The governance services subsystem for the Integration Daemon called intserv is about to start
Tue Nov 29 17:51:42 GMT 2022 intserv Startup INTEGRATION-DAEMON-SERVICES-0001 The integration daemon services are initializing in server intserv
Tue Nov 29 17:51:42 GMT 2022 intserv Startup OMIS-DATABASE-INTEGRATOR-0001 The database integrator context manager is being initialized for calls to server dbisrv on platform https://localhost:9443
Tue Nov 29 17:51:42 GMT 2022 intserv Startup INTEGRATION-DAEMON-SERVICES-0008 A new integration connector named jdbcresource is initializing in integration service Database Integrator OMIS running in integration daemon intserv, permitted synchronization is: From Third Party
Tue Nov 29 17:51:42 GMT 2022 intserv Startup OMIS-DATABASE-INTEGRATOR-0002 Creating context for integration connector jdbcresource (c60925ac-47b4-4929-9346-8243a2fa1d1b) connecting to third party technology jdbctest with permitted synchronization of From Third Party and service options of null
Tue Nov 29 17:51:42 GMT 2022 intserv Startup INTEGRATION-DAEMON-SERVICES-0040 The integration daemon thread for integration daemon intserv has started
Tue Nov 29 17:51:42 GMT 2022 intserv Information INTEGRATION-DAEMON-SERVICES-0041 The integration daemon thread is refreshing integration connector jdbcresource for the first time in the intserv integration daemon instance
Tue Nov 29 17:51:42 GMT 2022 intserv Startup INTEGRATION-DAEMON-SERVICES-0013 The integration daemon intserv has initialized
Tue Nov 29 17:51:42 GMT 2022 intserv Startup OMAG-ADMIN-0101 The governance services subsystem for the Integration Daemon called intserv has completed start up
Tue Nov 29 17:51:42 GMT 2022 intserv Startup OMAG-ADMIN-0004 The intserv server has successfully completed start up.  The following services are running: [Open Metadata Repository Services (OMRS), Database Integrator OMIS, Integration Daemon Services]
Tue Nov 29 18:01:50 GMT 2022 intserv Exception JDBC-INTEGRATION-CONNECTOR-0005 An SQL exception was received by method connect. Exception message is: {1}
Tue Nov 29 18:01:50 GMT 2022 intserv Exception JDBC-INTEGRATION-CONNECTOR-0005 Supplementary information: log record id d85fdead-1439-4aa0-bc0d-fdfae1f6e5a3 com.ibm.db2.jcc.am.DisconnectNonTransientConnectionException returned message of [jcc][t4][2030][11211][4.32.28] A communication error occurred during operations on the connection's underlying socket, socket input stream,
Tue Nov 29 18:01:50 GMT 2022 intserv Information JDBC-INTEGRATION-CONNECTOR-0001 Exiting from method JdbcIntegrationConnector.refresh as a result of a failed connection
Tue Nov 29 18:01:50 GMT 2022 intserv Information INTEGRATION-DAEMON-SERVICES-0043 The integration connector jdbcresource in integration daemon intserv has completed refresh processing in 608770 millisecond
Tue Nov 29 18:02:51 GMT 2022 intserv Information INTEGRATION-DAEMON-SERVICES-0042 The integration daemon thread is refreshing integration connector jdbcresource in integration daemon intserv
Tue Nov 29 18:13:00 GMT 2022 intserv Exception JDBC-INTEGRATION-CONNECTOR-0005 An SQL exception was received by method connect. Exception message is: {1}
Tue Nov 29 18:13:00 GMT 2022 intserv Exception JDBC-INTEGRATION-CONNECTOR-0005 Supplementary information: log record id a7e2938d-1a75-43f6-8fd4-2ce29301d5e5 com.ibm.db2.jcc.am.DisconnectNonTransientConnectionException returned message of [jcc][t4][2030][11211][4.32.28] A communication error occurred during operations on the connection's underlying socket, socket input stream,
Tue Nov 29 18:13:00 GMT 2022 intserv Information JDBC-INTEGRATION-CONNECTOR-0001 Exiting from method JdbcIntegrationConnector.refresh as a result of a failed connection
Tue Nov 29 18:13:00 GMT 2022 intserv Information INTEGRATION-DAEMON-SERVICES-0043 The integration connector jdbcresource in integration daemon intserv has completed refresh processing in 609038 millisecond
Tue Nov 29 18:14:00 GMT 2022 intserv Information INTEGRATION-DAEMON-SERVICES-0042 The integration daemon thread is refreshing integration connector jdbcresource in integration daemon intserv
github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 20 days if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 20 days if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 20 days if no further activity occurs. Thank you for your contributions.