Status / health check information for cohort connectivity

cmgrote commented 3 years ago

The objective of this issue is to discuss options and a proposed approach for exposing basic status / health check information on the connectivity of a given server to its cohort(s).

Understanding of current interactions:

Currently there are a number of OMRSTopicConnector instances that appear to define the interactions with a cohort
These appear to be managed in the OMRSCohortManager class
A number of instances of this OMRSCohortManager class (one per cohort?) are then available within the OMRSMetadataHighwayManager class, which manages the connectivity to each cohort that a local server is a member of
This OMRSMetadataHighwayManager class is then in turn exposed via APIs whose logic is implemented through the OMRSMetadataHighwayRESTServices class and ultimately bound to APIs (using Spring) in MetadataHighwayServicesResource
The actual connectivity to event buses within the OMRSTopicConnector appears to be through a list of OpenMetadataTopicConnector instances
Today there is no status information that is exposed through either the OMRSTopicConnector class (which itself implements a number of interfaces, notably OMRSTopic and OpenMetadataTopicListener as well as a base class ConnectorBase) or the OpenMetadataTopicConnector class (which also extends ConnectorBase and implements a different interface: OpenMetadataTopic)

Suggestions for providing status / health check information:

If we extend one or more of these interfaces / abstract classes with a method for retrieving status, this could then be served through underlying implementations
Examining an example underlying implementation (KafkaOpenMetadataTopicConnector, which extends OpenMetadataTopicListener), we can see that this ultimately extends the ConnectorBase abstract class
Considering further that it could be useful for any connector that can be started and disconnected to be able to communicate basic status / health-check information, perhaps it would make sense to extend the underlying Connector abstract class with such a status retrieval method (?)
There is already an isActive() method defined at the ConnectorBase level from which everything extends, but this is very binary and currently based purely on whether the connector has been started or disconnected

So as a proposed approach:

Add a method to the Connector abstract class to retrieve a connector status object (to be defined, but likely including at least an enumerated status (specific values to be defined), some more "free-form" informational field (string)?)
Provide a default implementation of this new method in ConnectorBase that simply re-uses the isActive() method also defined there to translate the binary (boolean) of isActive() into a basic status object
Allow any implementation of a Connector to override this new method with a more granular detection of various statuses (non-binary)
Use this connector-level method to surface a status (per cohort) in the OpenMetadataTopicConnector, etc classes upwards to the APIs

This would therefore not change any of the existing interfaces of a Connector while providing a default implementation of the logic that is based on already-existing and self-contained methods in the top-level abstract implementation (ConnectorBase), so I believe would also be backwards-compatible (?)

guptaneeru commented 3 years ago

Thank you Chris @cmgrote for opening this issue. This will be useful information to know the status of cohort/s for a local server. The way it is proposed, it will be connector agnostic.

guptaneeru commented 3 years ago

In addition to providing health check, we should also revisit the polling logic. Currently, Egeria tries to connect to Topic server in some sort of loop and fills up logs. This has been an issue whenever Kafka or server is not reachable. Logs are rolling over. We should poll in intervals not in a loop and also suppress logs if we can...

planetf1 commented 3 years ago

@guptaneeru probably multiple points here a) Whether an audit event is generated within that connection attempt - I'd err on probably but haven't looked in enough detail at how tight that is. b) The behaviour of the default audit log providers - for example it could handle repeated events better (last event occurred 10 times), or a wrapping logger could be provided. c) More than b) The fact that the audit log framework is pluggable - so a new logger could be written to better suit your needs (including log cycling etc)?

guptaneeru commented 3 years ago

Thank you @Nigel Nigle. How can I add our own audit logger?

planetf1 commented 3 years ago

@guptaneeru :

Info on the Audit Log Framework is at https://egeria.odpi.org/open-metadata-implementation/frameworks/audit-log-framework/
Using it at https://egeria.odpi.org/open-metadata-implementation/admin-services/docs/user/omag-server-platform-logging.html
The ALF code is at https://github.com/odpi/egeria/tree/master/open-metadata-implementation/frameworks/audit-log-framework
implementations of an audit logger at https://github.com/odpi/egeria/tree/master/open-metadata-implementation/adapters/open-connectors/repository-services-connectors/audit-log-connectors

In terms of the polling of kafka specifically. I've taken a look at the code. Issue odpi/egeria#5681 touches on this but is very specifically about the state of the server during the initialisation period. odpi/egeria-docs#447 is to clarify and document the startup behaviour. This issue is exploring specific approaches to understand more broadly the health of the system's connectors, which may go some way to address that issue. I think what's important is that the status can be understood at various levels (platform, server, connector(s) depending on the needs of the caller, and matching appropriate APIs that act upon the config). The need for this becomes more acute when we have replicas of a server, since we want to direct requests to the set of working replicas, not the bad ones...

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 20 days if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 20 days if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 20 days if no further activity occurs. Thank you for your contributions.

odpi / egeria

Status / health check information for cohort connectivity #5471