Open yuliiamir opened 11 months ago
Seems like there's another 'face' of this problem. Sometimes flow is stuck in 'Down' state, not in 'In Progress' one. The logs are attached. Seems like FLOW_VALIDATION_SPEAKER_WORKER ignores message which confirms successful installation of the non-ingress rules:
FLOW_VALIDATION_SPEAKER_WORKER:32 drop worker async response. because 33ecfdb2-cbe6-11ee-8ccc-159f8849f32d : 48c85531-d8a2-46d6-97a9-cb5bd609cb66 : fn-tests-feef6158-a013-4581-b864-45ed811f4273 key is not listed in pending response list [key: 33ecfdb2-cbe6-11ee-8ccc-159f8849f32d : 48c85531-d8a2-46d6-97a9-cb5bd609cb66 : fn-tests-feef6158-a013-4581-b864-45ed811f4273, message: SpeakerFlowSegmentResponse(super=SpeakerResponse(super=AbstractMessage(messageContext=MessageContext(correlationId=33b7e3f0-cbe6-11ee-8ccc-4ba1d5498525 : 48c85531-d8a2-46d6-97a9-cb5bd609cb66 : fn-tests-feef6158-a013-4581-b864-45ed811f4273, createTime=1707990006567), messageCookie=null), commandId=33ecfdb2-cbe6-11ee-8ccc-159f8849f32d, switchId=00:00:00:00:00:00:00:02), metadata=FlowSegmentMetadata(flowId=15Feb093937_966_saffron8334, cookie=0x2000000000011E55, multiTable=true), success=true, requestCreateTime=1707990006567, executionTime=2000000), context: CommandContext(correlationId=33b7e3f0-cbe6-11ee-8ccc-4ba1d5498525 : 48c85531-d8a2-46d6-97a9-cb5bd609cb66 : fn-tests-feef6158-a013-4581-b864-45ed811f4273, createTime=1707990008657, kafkaTopic=kilda.speaker.flowhs.priv, kafkaPartition=1, kafkaOffset=1)]
Hopefully, logs would help to investigate why does it happen. logs.json flow history.json
Logs for another flow with Timeout error
was checked and noticed that this key is not listed in pending response list' logs were after
Global timeout` from Floodlight.
Initial study of the problem shows that Global timeout of 1 minutes reached because of too long queries to Oriend Db.
On the following screenshot
time between two messages where we are going to execute query to Oriend DB is approximately 10 seconds.
Moving further along with the algorithm of creating flow we have the following screenshot
time between two messages where we are going to execute another query to Oriend DB is approximately 20 seconds. Also on the screenshot we can see that there is a timeout from northbound side at 17:22:31.
And after 1 minutes here is Global timeout occured.
A regular flow deployment failure has been caught during test execution on Jenkins. It is a flaky issue and when it occurs, it becomes a root cause of some test failures (switch validation, meter checks). Two timeouts occurred: kilda-NB Error MessageError{correlation_id=9482853d-fe70-4bd7-8a1e-bcf5c54ca944 : fn-tests-4cf5d794-e4e3-497e-bca5-d3b17518ecd5, timestamp=1700246791401, error-type=Operation has timed out, error-message=No response received, error-description=Timeout exceeded} caught. kilda-storm Flow 17Nov184559_878_saffloweroil4690 error - Global timeout reached for create operation on flow "17Nov184559_878_saffloweroil4690" BUT flow was created. Request: