telstra / open-kilda

OpenKilda is an open-source OpenFlow controller initially designed for use in a global network with high control-plane latency and a heavy emphasis on latency-centric data path optimisation.
Apache License 2.0
78 stars 53 forks source link

Regular Flow: Deployment failure: Operation has timed out #5480

Open yuliiamir opened 11 months ago

yuliiamir commented 11 months ago

A regular flow deployment failure has been caught during test execution on Jenkins. It is a flaky issue and when it occurs, it becomes a root cause of some test failures (switch validation, meter checks). Two timeouts occurred: kilda-NB Error MessageError{correlation_id=9482853d-fe70-4bd7-8a1e-bcf5c54ca944 : fn-tests-4cf5d794-e4e3-497e-bca5-d3b17518ecd5, timestamp=1700246791401, error-type=Operation has timed out, error-message=No response received, error-description=Timeout exceeded} caught. kilda-storm Flow 17Nov184559_878_saffloweroil4690 error - Global timeout reached for create operation on flow "17Nov184559_878_saffloweroil4690" BUT flow was created. Request:

curl -X POST \
http://localhost:8080/api/v2/flows \
-H 'Accept: application/json, application/*+json' \
-H 'correlation_id: fn-tests-4cf5d794-e4e3-497e-bca5-d3b17518ecd5' \
-H 'Content-Type: application/json' \
-H 'Content-Length: 797' \
-d '{
  "flow_id" : "17Nov184559_878_saffloweroil4690",
  "source" : {
    "switch_id" : "00:00:00:00:00:01:00:01",
    "port_number" : 10,
    "vlan_id" : 3241,
    "inner_vlan_id" : 0,
    "detect_connected_devices" : {
      "lldp" : false,
      "arp" : false
    }
  },
  "destination" : {
    "switch_id" : "00:00:00:00:00:01:00:02",
    "port_number" : 10,
    "vlan_id" : 637,
    "inner_vlan_id" : 0,
    "detect_connected_devices" : {
      "lldp" : false,
      "arp" : false
    }
  },
  "maximum_bandwidth" : 500,
  "ignore_bandwidth" : false,
  "strict_bandwidth" : false,
  "periodic_pings" : false,
  "description" : "autotest flow: It seems she hangs upon the cheek of night like a rich jewel in an Ethiope's ear.",
  "max_latency" : null,
  "max_latency_tier2" : null,
  "priority" : null,
  "diverse_flow_id" : null,
  "affinity_flow_id" : null,
  "pinned" : false,
  "allocate_protected_path" : false,
  "encapsulation_type" : null,
  "path_computation_strategy" : null,
  "statistics" : null
}' 
==================== request end ====================
2023-11-17T18:46:31,464 DEBUG LoggingRequestInterceptor:57 - 
==================== response begin ====================
Status code  : 408 REQUEST_TIMEOUT
Status text  : 
Headers      : [correlation_id:"9482853d-fe70-4bd7-8a1e-bcf5c54ca944 : fn-tests-4cf5d794-e4e3-497e-bca5-d3b17518ecd5", X-Content-Type-Options:"nosniff", X-XSS-Protection:"1; mode=block", Cache-Control:"no-cache, no-store, max-age=0, must-revalidate", Pragma:"no-cache", Expires:"0", X-Frame-Options:"DENY", Content-Type:"application/json", Transfer-Encoding:"chunked", Date:"Fri, 17 Nov 2023 18:46:31 GMT", Connection:"close"]
Response body: {
  "correlation_id" : "9482853d-fe70-4bd7-8a1e-bcf5c54ca944 : fn-tests-4cf5d794-e4e3-497e-bca5-d3b17518ecd5",
  "timestamp" : 1700246791401,
  "error-type" : "Operation has timed out",
  "error-message" : "No response received",
  "error-description" : "Timeout exceeded"
}

FlowDeploymentFailureTimoutIssue

pkazlenka commented 8 months ago

Seems like there's another 'face' of this problem. Sometimes flow is stuck in 'Down' state, not in 'In Progress' one. The logs are attached. Seems like FLOW_VALIDATION_SPEAKER_WORKER ignores message which confirms successful installation of the non-ingress rules:

FLOW_VALIDATION_SPEAKER_WORKER:32 drop worker async response. because 33ecfdb2-cbe6-11ee-8ccc-159f8849f32d : 48c85531-d8a2-46d6-97a9-cb5bd609cb66 : fn-tests-feef6158-a013-4581-b864-45ed811f4273 key is not listed in pending response list [key: 33ecfdb2-cbe6-11ee-8ccc-159f8849f32d : 48c85531-d8a2-46d6-97a9-cb5bd609cb66 : fn-tests-feef6158-a013-4581-b864-45ed811f4273, message: SpeakerFlowSegmentResponse(super=SpeakerResponse(super=AbstractMessage(messageContext=MessageContext(correlationId=33b7e3f0-cbe6-11ee-8ccc-4ba1d5498525 : 48c85531-d8a2-46d6-97a9-cb5bd609cb66 : fn-tests-feef6158-a013-4581-b864-45ed811f4273, createTime=1707990006567), messageCookie=null), commandId=33ecfdb2-cbe6-11ee-8ccc-159f8849f32d, switchId=00:00:00:00:00:00:00:02), metadata=FlowSegmentMetadata(flowId=15Feb093937_966_saffron8334, cookie=0x2000000000011E55, multiTable=true), success=true, requestCreateTime=1707990006567, executionTime=2000000), context: CommandContext(correlationId=33b7e3f0-cbe6-11ee-8ccc-4ba1d5498525 : 48c85531-d8a2-46d6-97a9-cb5bd609cb66 : fn-tests-feef6158-a013-4581-b864-45ed811f4273, createTime=1707990008657, kafkaTopic=kilda.speaker.flowhs.priv, kafkaPartition=1, kafkaOffset=1)]

Hopefully, logs would help to investigate why does it happen. logs.json flow history.json

yuliiamir commented 8 months ago

Logs for another flow with Timeout error was checked and noticed that this key is not listed in pending response list' logs were after Global timeout` from Floodlight. TimeoutDuringFlowCreation ResponseFromFlWasSentAfterGlobalTimeout KeyIsNotListenedInPendingResponseListAfterGlobalTimeout

elasticsearch_dump (1).json.gz

IvanChupin commented 3 months ago

Initial study of the problem shows that Global timeout of 1 minutes reached because of too long queries to Oriend Db.

On the following screenshot

Screenshot 2024-08-08 at 12 21 03

time between two messages where we are going to execute query to Oriend DB is approximately 10 seconds.

Moving further along with the algorithm of creating flow we have the following screenshot

Screenshot 2024-08-08 at 12 21 46

time between two messages where we are going to execute another query to Oriend DB is approximately 20 seconds. Also on the screenshot we can see that there is a timeout from northbound side at 17:22:31.

And after 1 minutes here is Global timeout occured.

image