Closed cristianmtr closed 3 years ago
Why is this a separate issue and not part of #2235?
Why is this a separate issue and not part of #2235?
So I can close it and not close the main issue. It is part of it, as a subtask. But it's also a general issue in the gateway that we only uncovered now, and might be behind some other tests hanging
This should be considered when dynamic start of Pods and Peas is there, but right now it should not be a problem, TERMINATION just happens when Flow context manages is finished right?
This should be considered when dynamic start of Pods and Peas is there, but right now it should not be a problem, TERMINATION just happens when Flow context manages is finished right?
Yes, but it should be its own PR, and something that won't be merged into that PR, to make it easier to split work and work in parallel
We talked about this in the morning team lead meeting. If my notes are correct, this is needed for production-ready Jina but won't be worked on until after the MVP is implemented, right? (Due to capacity)
CC @cristianmtr @maximilianwerk
Yes. Also not sure if we've managed to reproduce it yet even with the 'CANCEL' request PR @florian-hoenicke ?
Revisiting this issue.
What was the observed problem? It would be good to have a detailed explanation of the issue
I don't remember exactly now. It was related to the flow around the CANCEL message. If the Replica Router still has a pending request, and that request is terminated (when the inner Peas are terminated from the replica being taken offline), the gateway will wait forever for the request to be finished, but won't. Haven't been able to reproduce.
@florian-hoenicke did you?
There is a cyclic reference between this issue and #2235
Let me share what I found by trying
to reproduce the test
:
def test_vector_indexer_thread(config):
with Flow().add(
name='INDEXER',
uses=os.path.join(cur_dir, 'yaml/mock_index_vector.yml'),
replicas=2,
parallel=2,
) as flow:
for i in range(5):
flow.search(get_doc(i))
x = threading.Thread(target=flow.rolling_update, args=('INDEXER',))
# TODO there is a problem with the gateway even after request times out - open issue
# TODO remove the join to make it asynchronous again
for i in range(50):
print(f' search {i}')
flow.search(get_doc(i))
if i == 3:
x.start()
print(f' search finished? ')
x.join()
First of all one thing surprises me: It seems that the HeadPea
inside every replica
is not properly running as a router
. Inside every Replica
, it seems that both shards receive the data and the ROUTER-DEALER
pattern is not working. Not sure if expected.
Then to the problem: The test blocks because there is a SearchRequest that arrives to the REPLICA0/head
before it is closed and while the first
shard of the replica
has already been closed. And since there is no ROUTER/DRIVER
proper pattern, the tail
of the replica
keeps waiting for the 2nd part of the message.
Potential problems seen:
shards
inside a replica
. (These configurations should be heavily tested in #2347)closing
order of the Peas
inside a regular Pod
.I would preferably first have a diagnosis
of the 1st problem.
@JoanFM closing this as mentioned in meeting.
a Pod/Pea can be terminated at any point of time without causing deadlocks.
See https://github.com/jina-ai/jina/issues/2235
Handle processing data request when a termination message. Gateway will wait until it receives the message. Even on Async. Introduce a timeout. Research. Might not be on the gateway level.
Gateway should have a timeout on messages and just return.
Terminate a Pod/Pea and test gateway still can recover
Test for Async Flows too