signalfx / signalfx-python

Python client library and instrumentation bindings for SignalFx (Deprecated)
https://www.signalfx.com
Apache License 2.0
43 stars 47 forks source link

make sure we only yield when the batch is complete #136

Closed dloucasfx closed 1 year ago

dloucasfx commented 1 year ago

The logic here expects that batched data events (ie: data messages with same timestamp) come right after another, otherwise the logic will break. The issue occurs when the server sends events out of order; for example, the below series of events, which randomly happens, is messing up the logic.

Dispatching {
  "data" : [ {
    "tsId" : "AAAAAETu1XY",
    "value" : 64
  } ],
  "logicalTimestampMs" : 1672538700000,
  "maxDelayMs" : 9000
}...
Dispatching {
  "logicalTimestampMs" : 1672538400000,
  "message" : {
    "blockContexts" : [ {
      "column" : 8,
      "line" : 3
    } ],
    "blockSerialNumbers" : [ 3 ],
    "contents" : {
      "aperiodicity" : true,
      "nativeness" : false
    },
    "messageCode" : "FETCH_NATIVE_RESOLUTION",
    "messageLevel" : "INFO",
    "numInputTimeSeries" : 2,
    "timestampMs" : 1672530900000
  }
}...
Dispatching {
  "logicalTimestampMs" : 1672538400000,
  "message" : {
    "blockContexts" : [ {
      "column" : 7,
      "line" : 2
    } ],
    "blockSerialNumbers" : [ 0 ],
    "contents" : {
      "aperiodicity" : false,
      "nativeness" : false
    },
    "messageCode" : "FETCH_NATIVE_RESOLUTION",
    "messageLevel" : "INFO",
    "numInputTimeSeries" : 3,
    "timestampMs" : 1672530900000
  }
}...
Dispatching {
  "data" : [ {
    "tsId" : "AAAAAJ61Dec",
    "value" : 64
  } ],
  "logicalTimestampMs" : 1672538700000,
  "maxDelayMs" : 9000
}...

If I understand this correctly, the yield in the info event here should “and meant to” only occur once, after the first data event batch is built.

However, when the logic process the events [data1-Time1]->[info]->[info]->[data2-Time2] , the yield in info is being triggered on a single data message [data1-Time1] causing _current_batch_count to get reset from 1 to 0 in _get_batch_to_yield . Next, when [data2-Time2] gets processed, the _current_batch_count is incremented to 1 but because there is no more events with the same timestamps, _current_batch_count will always be 1 and we never meet this condition as _expected_batches is determined to be 2.

The fix here adds an extra check in the info to make sure we only yield when the batch is complete

cc: @mahdibh @mpetazzoni

UPDATE: This solution assumes we always get data for the full batch; I think it will fail if we don't as the added condition will not be met and we'll never yield. I am not sure if this a valid scenario and how to address it. I was hoping that the info message will have a hint in its message code to signal that the data event is missing at a particular timestamp, but I don't see any here

Signed-off-by: Dani Louca dlouca@splunk.com

dloucasfx commented 1 year ago

@rmfitzpatrick was reviewing this PR when we checked other lib, Java and Ruby follows the same logic https://github.com/signalfx/signalfx-java/blob/main/signalfx-java/src/main/java/com/signalfx/signalflow/Computation.java#L272 https://github.com/signalfx/signalfx-ruby/blob/master/lib/signalfx/signalflow/computation.rb#L149

but not GO I wonder if this is a server side issue

dloucasfx commented 1 year ago

Issue is caused by server side change, closing PR