jina-ai / serve

☁️ Build multimodal AI applications with cloud-native stack
https://jina.ai/serve
Apache License 2.0
21.13k stars 2.22k forks source link

Pods can be closed as in rolling update at any point without deadlocks #2255

Closed cristianmtr closed 3 years ago

cristianmtr commented 3 years ago

a Pod/Pea can be terminated at any point of time without causing deadlocks.

See https://github.com/jina-ai/jina/issues/2235

Handle processing data request when a termination message. Gateway will wait until it receives the message. Even on Async. Introduce a timeout. Research. Might not be on the gateway level.

Gateway should have a timeout on messages and just return.

Terminate a Pod/Pea and test gateway still can recover

Test for Async Flows too

FionnD commented 3 years ago

Why is this a separate issue and not part of #2235?

cristianmtr commented 3 years ago

Why is this a separate issue and not part of #2235?

So I can close it and not close the main issue. It is part of it, as a subtask. But it's also a general issue in the gateway that we only uncovered now, and might be behind some other tests hanging

JoanFM commented 3 years ago

This should be considered when dynamic start of Pods and Peas is there, but right now it should not be a problem, TERMINATION just happens when Flow context manages is finished right?

cristianmtr commented 3 years ago

This should be considered when dynamic start of Pods and Peas is there, but right now it should not be a problem, TERMINATION just happens when Flow context manages is finished right?

Yes, but it should be its own PR, and something that won't be merged into that PR, to make it easier to split work and work in parallel

FionnD commented 3 years ago

We talked about this in the morning team lead meeting. If my notes are correct, this is needed for production-ready Jina but won't be worked on until after the MVP is implemented, right? (Due to capacity)

CC @cristianmtr @maximilianwerk

cristianmtr commented 3 years ago

Yes. Also not sure if we've managed to reproduce it yet even with the 'CANCEL' request PR @florian-hoenicke ?

JoanFM commented 3 years ago

Revisiting this issue.

What was the observed problem? It would be good to have a detailed explanation of the issue

cristianmtr commented 3 years ago

I don't remember exactly now. It was related to the flow around the CANCEL message. If the Replica Router still has a pending request, and that request is terminated (when the inner Peas are terminated from the replica being taken offline), the gateway will wait forever for the request to be finished, but won't. Haven't been able to reproduce.

@florian-hoenicke did you?

JoanFM commented 3 years ago

There is a cyclic reference between this issue and #2235

JoanFM commented 3 years ago

Let me share what I found by trying to reproduce the test :

def test_vector_indexer_thread(config):

    with Flow().add(
            name='INDEXER',
            uses=os.path.join(cur_dir, 'yaml/mock_index_vector.yml'),
            replicas=2,
            parallel=2,
    ) as flow:

        for i in range(5):
            flow.search(get_doc(i))
        x = threading.Thread(target=flow.rolling_update, args=('INDEXER',))
        # TODO there is a problem with the gateway even after request times out - open issue
        # TODO remove the join to make it asynchronous again
        for i in range(50):
            print(f' search {i}')
            flow.search(get_doc(i))
            if i == 3:
                x.start()

        print(f' search finished? ')
        x.join()

First of all one thing surprises me: It seems that the HeadPea inside every replica is not properly running as a router. Inside every Replica, it seems that both shards receive the data and the ROUTER-DEALER pattern is not working. Not sure if expected.

Then to the problem: The test blocks because there is a SearchRequest that arrives to the REPLICA0/head before it is closed and while the first shard of the replica has already been closed. And since there is no ROUTER/DRIVER proper pattern, the tail of the replica keeps waiting for the 2nd part of the message.

Potential problems seen:

I would preferably first have a diagnosis of the 1st problem.

FionnD commented 3 years ago

@JoanFM closing this as mentioned in meeting.