odpi / egeria

Egeria core
https://egeria-project.org
Apache License 2.0
805 stars 260 forks source link

[BUG] OMRS connectivity issues if Kafka is down during the metadata repository start up #6791

Closed pmadugundu closed 2 years ago

pmadugundu commented 2 years ago

Is there an existing issue for this?

Current Behavior

A related issue: https://github.com/odpi/egeria/issues/5471.

Note: At present, there is no proper capability in Egeria library to find the health of OMRS connection for a cohort. I think https://github.com/odpi/egeria/issues/5471 will fill this gap. The WKC services determines the status of OMRS health by calling OMRSRepositoryRESTServices.getMetadataCollectionId() method. If the method returns non-null metadata collection ID, it is considered as CONNECTED, otherwise DISCONNECTED. I understand that this is not the right approach.

I would like to report the following 2 issues to improve the resilience of OMRS connectivity from the metadata repository:

  1. Consider the metadata repository that is already registered to a cohort is restarted , but the Kafka is down. Egeria attempts to connect to Kafka for 30 mins. In this 30 minutes period, the present method (explained above) gives wrong status, CONNECTED. And, the OMRS messages sent to the event mapper are lost. After 30 mins, the present method returns the right status, DISCONNECTED.

  2. If Kafka is restored before 30 mins, OMRS connections is established and OMRS messages are sent to the cohort. But, if the Kafka is restored after 30 mins, the connection is never re-established.

Expected Behavior

  1. The Egeria library should provide the right status all the time.

  2. OMRS connection should be re-established event when Kafka is restored after 30 mins.

Steps To Reproduce

No response

Environment

- Egeria: 3.10
- OS:
- Java:
- Browser (for UI issues):
- Additional connectors and integration:

Any Further Information?

No response

mandy-chessell commented 2 years ago

I am struggling to understand the description above so forgive me as I seek clarification. Am I understanding the request correctly?

Background description of cohort functionality

Connectivity to a cohort is via multiple connections. There are the event bus connections (kafka in this case) to the three topics and there are the REST API connections to all other members to support federated queries. Kafka can be down and the cohort can still be delivering metadata to its callers without error and in that sense it is still connected.

Kafka is needed to connect a new member. If it is down the registration request needs to be retried. This is reflected in the cohort connection status. The cohort connection status is not a reflection of whether kafka is up or down but a reflection of the state of the registration to the cohort and whether the cohort manager is running. Longer term, the plan is to allow a server to control its connectivity to the cohort - such as registering and unregistering from a cohort, but this is not implemented at the moment.

The OMRSRepositoryRESTServices.getMetadataCollectionId() call listed above has no reflection on whether the kafka topic conection is up or down. It calls the local repository to see if it is initialised. A metadata server can run successfully without connecting to any cohorts. If the setting of this value seems to coincide with the connection to the cohort topics in your set up then it is a coincidence because both features are initialised at similar times.

Kafka provides background exchange of events about types and instances. The protocol is such that lost of these types of events are no problem because the federated queries drive refresh events. In general, small outages in kafka have no impact on the working of cohort, so changing the cohort connection status when kafka is down would be invalid.

My understanding of the request

What I think is being asked for is an ability to determine the health of each of the connections:

This is a good idea but it is a new feature request rather than a bug.

There has been some background work to give connectors the ability to post status and statistics. Neither the REST nor the kafka connector takes advantage of this feature yet and this would be an area of enhancement.

It is possible to see the status and statistics of the connectors though the audit log report - but since this gives the status of all components in the server, it is a cumbersome mechanism. Ideally there would be a new metadata highway REST API call to allow the status of the topic/REST connectors to be queried.

Is this what you are asking for?

pmadugundu commented 2 years ago

@mandy-chessell Thanks for your response.

The ability to determine the health of each of the cohort topic connection will address the 1st issue in my comment. And, please note that, at present, REST API is not used.

Regarding the 2nd issue in my first comment, it seem Egeria attempts to reconnect only for 30 mins. If Kafka is restored after 30 mins, the connection is never restored. Please consider this issue as well.

mandy-chessell commented 2 years ago

The federated queries is an important part of the design and is used by other organizations. If fact, although all members do not need to support it, we do not recommend running a cohort without at least one member using the federated queries.

A solution that does not consider the whole cohort protocol is not attractive.

mandy-chessell commented 2 years ago

Thinking about this more, it seems there are two design approaches.

From the description above the connector tries to reconnect for 30 mins and then if kafka fails to start in time, the connector does not bring the server down as it is supposed to do. So, the behavior we have is in limbo between the two approaches whch makes the server difficult to control. We should make a choice about which one to go with.

I am currently in favour of the second approach because a server could be connected to multiple event buses and if one was flakey it would be annoying to have it cause the server to fail - and even if all event buses were down, the server could still be operating correctly via the REST APIs.

However, I would also like @planetf1 's view on this since he is working through the design for the automated operator.

planetf1 commented 2 years ago

@pmadugundu Apologies for delay (vacation - through to Wed). First a response to the original text:

planetf1 commented 2 years ago

@mandy-chessell Now to respond to your excellent description of behaviour

My inclination therefore @pmadugundu , at least for now is to:

planetf1 commented 2 years ago

I took a look at reported status.

{{baseURL}}/open-metadata/platform-services/users/{{user}}/server-platform/servers/{{server}}/status

for cocoMDS2 during a period Kafka is down but the server is starting, is reporting as follows

This can appear confusing, since though the server is in the process of starting -- but we could argue the fact it is starting up, doing something, justifies a state of 'active', just that its not ready ie:

{
    "class": "ServerStatusResponse",
    "relatedHTTPCode": 200,
    "serverName": "cocoMDS2",
    "serverType": "Metadata Access Store",
    "serverStartTime": "2022-08-07T17:07:31.688+00:00",
    "active": true
}

{{baseURL}}/open-metadata/platform-services/users/{{user}}/server-platform/servers/{{server}}/services

{
    "class": "ServerServicesListResponse",
    "relatedHTTPCode": 200,
    "serverName": "cocoMDS2",
    "serverServicesList": [
        "Open Metadata Repository Services (OMRS)",
        "OMAG Server Operational Services"
    ]
}

{{baseURL}}/open-metadata/platform-services/users/{{user}}/server-platform/servers/active

{
    "class": "ServerListResponse",
    "relatedHTTPCode": 200,
    "serverList": [
        "cocoMDS2"
    ]
}
planetf1 commented 2 years ago

But the key is to use the more fine grained service, which shows whether the service is starting, or actually started. -- for example

{{baseURL}}/open-metadata/admin-services/users/{{user}}/servers/{{server}}/instance/status

{
    "class": "OMAGServerStatusResponse",
    "relatedHTTPCode": 200,
    "serverStatus": {
        "serverName": "cocoMDS2",
        "serverType": "Metadata Access Store",
        "serverActiveStatus": "STARTING",
        "services": [
            {
                "serviceName": "Open Metadata Repository Services (OMRS)",
                "serviceStatus": "STARTING"
            }
        ]
    }
}
planetf1 commented 2 years ago

Here you can see 'serverActiveStatus' says 'STARTING' - ie it is not ready yet.

You'll see audit log messages during this time

Sun Aug 07 18:50:34 BST 2022 cocoMDS2 Startup OCF-KAFKA-TOPIC-CONNECTOR-0015 The local server is attempting to connect to Kafka brokers at localhost:9092 [ attempt 3 of 10 ]

You can then wait, and if the timeout expires you will then get:

{
    "class": "OMAGServerStatusResponse",
    "relatedHTTPCode": 404,
    "exceptionClassName": "org.odpi.openmetadata.frameworks.connectors.ffdc.InvalidParameterException",
    "actionDescription": "getActiveServerStatus",
    "exceptionErrorMessage": "OMAG-MULTI-TENANT-404-001 The OMAG Server cocoMDS2 is not available to service a request from user garygeeke",
    "exceptionErrorMessageId": "OMAG-MULTI-TENANT-404-001",
    "exceptionErrorMessageParameters": [
        "cocoMDS2",
        "garygeeke"
    ],
    "exceptionSystemAction": "The system is unable to process the request because the server is not running on the called platform.",
    "exceptionUserAction": "Verify that the correct server is being called on the correct platform and that this server is running. Retry the request when the server is available.",
    "exceptionProperties": {
        "serverName": "cocoMDS2",
        "parameterName": "serverName"
    }
}

Or, if Kafka does come up in the time you will get:

{
    "class": "OMAGServerStatusResponse",
    "relatedHTTPCode": 200,
    "serverStatus": {
        "serverName": "cocoMDS2",
        "serverType": "Metadata Access Store",
        "serverActiveStatus": "RUNNING",
        "services": [
            {
                "serviceName": "Subject Area OMAS",
                "serviceStatus": "RUNNING"
            },
            {
                "serviceName": "Security Officer OMAS",
                "serviceStatus": "RUNNING"
            },
            {
                "serviceName": "Open Metadata Repository Services (OMRS)",
                "serviceStatus": "RUNNING"
            },
            {
                "serviceName": "Data Privacy OMAS",
                "serviceStatus": "RUNNING"
            },
            {
                "serviceName": "Community Profile OMAS",
                "serviceStatus": "RUNNING"
            },
            {
                "serviceName": "Asset Consumer OMAS",
                "serviceStatus": "RUNNING"
            },
            {
                "serviceName": "Asset Lineage OMAS",
                "serviceStatus": "RUNNING"
            },
            {
                "serviceName": "Asset Catalog OMAS",
                "serviceStatus": "RUNNING"
            },
            {
                "serviceName": "IT Infrastructure OMAS",
                "serviceStatus": "RUNNING"
            },
            {
                "serviceName": "Asset Owner OMAS",
                "serviceStatus": "RUNNING"
            },
            {
                "serviceName": "Connected Asset Services",
                "serviceStatus": "STARTING"
            },
            {
                "serviceName": "Digital Architecture OMAS",
                "serviceStatus": "RUNNING"
            },
            {
                "serviceName": "Glossary View OMAS",
                "serviceStatus": "RUNNING"
            },
            {
                "serviceName": "Governance Program OMAS",
                "serviceStatus": "RUNNING"
            },
            {
                "serviceName": "Project Management OMAS",
                "serviceStatus": "RUNNING"
            },
            {
                "serviceName": "Governance Engine OMAS",
                "serviceStatus": "RUNNING"
            }
        ]
    }
}

(in this example, from our labs, we have a variety of OMAS running too)

So I think you have the ability to

This seems to be sufficient.. even without any changes to how connector status is reported, or any changes to whether Kafka is essential or not (which is discussion we can continue).

@pmadugundu does this make sense?

I'd also be inclined to suggest the DEFAULT for Kafka topic connector is to NOT do any retries - I feel it's much simpler then, though I wouldn't propose removing the capability.

planetf1 commented 2 years ago

I think we have addressed the startup case. I've opened #6813 as a follow-on to track the discussion around behaviour once the server is up.