Open mathnogueira opened 2 years ago
On the polling solution, should you be able to specify the polling interval in addition to the max number of polls?
On the solution where you wait for a particular span, should we add a 'maximum time to wait'?
Today we have that in tracetest's global configuration:
pollingConfig:
timeout: 10m
retryDelay: 5s
Maybe it makes sense to have that as the default configuration and allow tests to override that information
@schoren can you add your proposal here?
I thought we could keep our current, global timeout configuration, and in top of that add a per test override
. A yaml with this new setting could look like this:
type: Test
spec:
id: 1234
name: The Test name
# this is the new part
completeCondition:
selector: span[name="Persist Cart Shipping"]
value: attr:tracetest.selected_spans.count = 1
timeout: 1hs
trigger:
// ...
completeCondition
defines an override to the tracetest configured trace poll timeout
. It has 2 options:
selector/value: Keep polling until this selector/value is ok. Example: You're testing a purchase process that needs to have a trace up to the shipping approval. Instead of setting an overall huge timeout, make this one particular test keep polling until the condition selector/value
is met.
timeout: This is just a time based override. Example: most of your tests take 2 mins to finish, except for a few long running processes. Instead of making all tests have a long timeout, make them have a short timeout, and override the ones that are longer.
Both conditions can be set togeteher, so if your selector/value
condition is never met, you can still have a custom timeout, otherwise lacking since this completeCondition
overrides the general timeout setting.
Does that make sense?
Couple of comments:
It would be good if the config info globally had the same (or similar) structure to the per test override. Labeling the strategy / strategy type makes sense - we have identified a 'sameSpanCount' and 'assert' - there may be others in the future.
I could see a case for defining some common polling scenarios at a global level... but this might be (is?) overkill.
I might tend to have this at a global level:
polling:
default: fastTest
options:
- longRunningTest:
name: Long running test
strategy: sameSpanCount
executions: 3 # if the number of spans is the same in 3 subsequent executions, the trace is marked as complete
retryDelay: 60s
timeout: 120m
- mediumTest:
name: Medium length test (1 to 10 minutes)
strategy: sameSpanCount
executions: 3 # if the number of spans is the same in 3 subsequent executions, the trace is marked as complete
retryDelay: 15s
timeout: 10m
- fastTest:
name: Fast test (less than a minute)
default: true
strategy: sameSpanCount
executions: 3 # if the number of spans is the same in 3 subsequent executions, the trace is marked as complete
retryDelay: 2s
timeout: 1m
In a test level, I would have the exact same structure.
polling:
default: fastTest
or, if they have set a custom one:
polling:
default: custom
options:
- custom:
name: check for db span at end
strategy: assert
selector: span[name = "insert pokemon into database"
assertion: tracetest.selected_spans.count = 1
retryDelay: 10s
timeout: 10m
We need to consider how this would appear in UI also @olha23 once we get consensus on what to implement (the above may be too much)
I like your suggestion, @kdhamric. Maybe we could have polling profiles
:
polling:
defaultProfile: fastTest
profiles:
- longRunningTest:
name: Long running test
strategy: sameSpanCount
executions: 3 # if the number of spans is the same in 3 subsequent executions, the trace is marked as complete
retryDelay: 60s
timeout: 120m
- mediumTest:
name: Medium length test (1 to 10 minutes)
strategy: sameSpanCount
executions: 3 # if the number of spans is the same in 3 subsequent executions, the trace is marked as complete
retryDelay: 15s
timeout: 10m
- fastTest:
name: Fast test (less than a minute)
default: true
strategy: sameSpanCount
executions: 3 # if the number of spans is the same in 3 subsequent executions, the trace is marked as complete
retryDelay: 2s
timeout: 1m
polling:
profile: fastTest
@kdhamric from UI perspective we should allow users to define the number of times a trace can be polled before being marked as “complete” in the test configuration?
@olha23 Yes, we should. Eventually, we will allow people to define multiple profiles as given above. For that, they would need to be able to specify all the fields: name, strategy, executions, retry delay, and timeout. I am not sure when we will tackle this work - it would not hurt to begin to mock some of it.
Team, giving my +1 here since our polling config is affecting integrations with OTel Collector. I like the idea of having global polling configs (like profiles
), and I believe that we could think about them in a similar way that k8s deals with probes (k8s API).
I believe that we could add one more option: initialDelaySeconds
and change executions
to stopThreshold
.
polling:
defaultProfile: fastTest
profiles:
- somePolling:
name: Some polling
default: true
strategy: sameSpanCount
stopThreshold: 3 # if our strategy signalizes that we should stop three times in a row,
# we should stop polling
initialDelay: 1s
retryDelay: 2s
timeout: 1m
With that, we can model quick pollings, like API calls:
profiles:
- apiPolling:
name: API Polling
strategy: sameSpanCount
stopThreshold: 3 # if our strategy signalizes that we should stop three times in a row,
# we should stop polling
initialDelay: 0s
retryDelay: 2s
timeout: 1m
And longer pollings, like async ones:
profiles:
- asyncContinuousPolling:
name: Async continuous Polling
strategy: sameSpanCount
stopThreshold: 3 # if our strategy signalizes that we should stop three times in a row,
# we should stop polling
initialDelay: 0s
retryDelay: 10s
timeout: 10m
- fireAndForgetPolling: # for cases where there is a delay between
# starting the action and executing the action
name: Async continuous Polling
strategy: sameSpanCount
stopThreshold: 5 # if our strategy signalizes that we should stop three times in a row,
# we should stop polling
initialDelay: 1m
retryDelay: 30s
timeout: 10m
Problem
When Tracetest triggers a test and waits for the trace to be available, it executes a very simple condition to ensure the trace is complete. Today, the logic is very simple: if a trace is polled twice in a row and the number of spans didn't change, consider it complete. However, this is not true for long-running operations.
With the current implementation, if we had one test that takes a long time to generate the complete trace, we would have to adjust the pooling configuration to make sure to add enough time between polling executions to prevent false-positive complete traces. However, this would slow down all other tests executed by that tracetest instance.
Possible solution
Enable the polling condition to be set at a test level
Instead of hardcoding the condition, we could let a test define when a trace is complete and can be asserted. If this is not configured in the test, we would use the default implementation.
Test with default polling configuration
When the
polling
object is not defined, the default polling strategy would be by counting the number of spans and check if they are equal. If they are equal in two subsequent polling executions, the trace is marked as complete.Test using the same default polling strategy, but with different parameter
Test using an assertion to stop polling