Improve trace pooling process

mathnogueira commented 2 years ago

Problem

When Tracetest triggers a test and waits for the trace to be available, it executes a very simple condition to ensure the trace is complete. Today, the logic is very simple: if a trace is polled twice in a row and the number of spans didn't change, consider it complete. However, this is not true for long-running operations.

With the current implementation, if we had one test that takes a long time to generate the complete trace, we would have to adjust the pooling configuration to make sure to add enough time between polling executions to prevent false-positive complete traces. However, this would slow down all other tests executed by that tracetest instance.

Possible solution

Enable the polling condition to be set at a test level

Instead of hardcoding the condition, we could let a test define when a trace is complete and can be asserted. If this is not configured in the test, we would use the default implementation.

Test with default polling configuration

When the polling object is not defined, the default polling strategy would be by counting the number of spans and check if they are equal. If they are equal in two subsequent polling executions, the trace is marked as complete.

name: My test
trigger:
    # how to trigger the test

spec:
    # assertions

Test using the same default polling strategy, but with different parameter

name: My test
trigger:
    # how to trigger the test

polling:
    strategy: sameSpanCount
    sameSpanCount:
        executions: 3 # if the number of spans is the same in 3 subsequent executions, the trace is marked as complete

spec:
    # assertions

Test using an assertion to stop polling

# slow-test.yaml
name: My slow test
trigger:
   # how to trigger the test

polling:
   # polling would only stop when a span named "insert pokemon into database" is present in the trace
   strategy: assert
   assert:
       selector: span[name = "insert pokemon into database"
       assertion: tracetest.selected_spans.count = 1

spec:
    # assertions

kdhamric commented 2 years ago

On the polling solution, should you be able to specify the polling interval in addition to the max number of polls?

On the solution where you wait for a particular span, should we add a 'maximum time to wait'?

mathnogueira commented 2 years ago

Today we have that in tracetest's global configuration:

pollingConfig:
    timeout: 10m
    retryDelay: 5s

Maybe it makes sense to have that as the default configuration and allow tests to override that information

mathnogueira commented 2 years ago

@schoren can you add your proposal here?

schoren commented 2 years ago

I thought we could keep our current, global timeout configuration, and in top of that add a per test override. A yaml with this new setting could look like this:

type: Test
spec:
  id: 1234
  name: The Test name

  # this is the new part
  completeCondition:
    selector: span[name="Persist Cart Shipping"]
    value: attr:tracetest.selected_spans.count = 1
    timeout: 1hs

  trigger:
  // ...

completeCondition defines an override to the tracetest configured trace poll timeout. It has 2 options:

selector/value: Keep polling until this selector/value is ok. Example: You're testing a purchase process that needs to have a trace up to the shipping approval. Instead of setting an overall huge timeout, make this one particular test keep polling until the condition selector/value is met.
timeout: This is just a time based override. Example: most of your tests take 2 mins to finish, except for a few long running processes. Instead of making all tests have a long timeout, make them have a short timeout, and override the ones that are longer.

Both conditions can be set togeteher, so if your selector/value condition is never met, you can still have a custom timeout, otherwise lacking since this completeCondition overrides the general timeout setting.

Does that make sense?

kdhamric commented 2 years ago

Couple of comments:

It would be good if the config info globally had the same (or similar) structure to the per test override. Labeling the strategy / strategy type makes sense - we have identified a 'sameSpanCount' and 'assert' - there may be others in the future.

I could see a case for defining some common polling scenarios at a global level... but this might be (is?) overkill.

I might tend to have this at a global level:

polling:
   default: fastTest
   options:
    - longRunningTest:
        name: Long running test
        strategy: sameSpanCount
        executions: 3 # if the number of spans is the same in 3 subsequent executions, the trace is marked as complete
        retryDelay: 60s
        timeout: 120m
    - mediumTest:
        name: Medium length test (1 to 10 minutes)
        strategy: sameSpanCount
        executions: 3 # if the number of spans is the same in 3 subsequent executions, the trace is marked as complete
        retryDelay: 15s
        timeout: 10m
    - fastTest:
        name: Fast test (less than a minute)
        default: true
        strategy: sameSpanCount
        executions: 3 # if the number of spans is the same in 3 subsequent executions, the trace is marked as complete
        retryDelay: 2s
        timeout: 1m

In a test level, I would have the exact same structure.

 polling:
   default: fastTest

or, if they have set a custom one:

 polling:
   default: custom
   options:
    - custom:
        name: check for db span at end
        strategy: assert
        selector: span[name = "insert pokemon into database"
        assertion: tracetest.selected_spans.count = 1
        retryDelay: 10s
        timeout: 10m

We need to consider how this would appear in UI also @olha23 once we get consensus on what to implement (the above may be too much)

mathnogueira commented 2 years ago

I like your suggestion, @kdhamric. Maybe we could have polling profiles:

polling:
   defaultProfile: fastTest
   profiles:
    - longRunningTest:
        name: Long running test
        strategy: sameSpanCount
        executions: 3 # if the number of spans is the same in 3 subsequent executions, the trace is marked as complete
        retryDelay: 60s
        timeout: 120m
    - mediumTest:
        name: Medium length test (1 to 10 minutes)
        strategy: sameSpanCount
        executions: 3 # if the number of spans is the same in 3 subsequent executions, the trace is marked as complete
        retryDelay: 15s
        timeout: 10m
    - fastTest:
        name: Fast test (less than a minute)
        default: true
        strategy: sameSpanCount
        executions: 3 # if the number of spans is the same in 3 subsequent executions, the trace is marked as complete
        retryDelay: 2s
        timeout: 1m

polling:
  profile: fastTest

olha23 commented 1 year ago

@kdhamric from UI perspective we should allow users to define the number of times a trace can be polled before being marked as “complete” in the test configuration?

kdhamric commented 1 year ago

@olha23 Yes, we should. Eventually, we will allow people to define multiple profiles as given above. For that, they would need to be able to specify all the fields: name, strategy, executions, retry delay, and timeout. I am not sure when we will tackle this work - it would not hurt to begin to mock some of it.

danielbdias commented 1 year ago

Team, giving my +1 here since our polling config is affecting integrations with OTel Collector. I like the idea of having global polling configs (like profiles), and I believe that we could think about them in a similar way that k8s deals with probes (k8s API).

I believe that we could add one more option: initialDelaySeconds and change executions to stopThreshold.

polling:
   defaultProfile: fastTest
   profiles:
    - somePolling:
        name: Some polling
        default: true
        strategy: sameSpanCount
        stopThreshold: 3 # if our strategy signalizes that we should stop three times in a row, 
                         # we should stop polling
        initialDelay: 1s
        retryDelay: 2s
        timeout: 1m

With that, we can model quick pollings, like API calls:

   profiles:
    - apiPolling:
        name: API Polling
        strategy: sameSpanCount
        stopThreshold: 3 # if our strategy signalizes that we should stop three times in a row, 
                         # we should stop polling
        initialDelay: 0s
        retryDelay: 2s
        timeout: 1m

And longer pollings, like async ones:

   profiles:
    - asyncContinuousPolling:
        name: Async continuous Polling
        strategy: sameSpanCount
        stopThreshold: 3 # if our strategy signalizes that we should stop three times in a row, 
                         # we should stop polling
        initialDelay: 0s
        retryDelay: 10s
        timeout: 10m
    - fireAndForgetPolling: # for cases where there is a delay between 
                            # starting the action and executing the action
        name: Async continuous Polling
        strategy: sameSpanCount
        stopThreshold: 5 # if our strategy signalizes that we should stop three times in a row, 
                         # we should stop polling
        initialDelay: 1m
        retryDelay: 30s
        timeout: 10m

kubeshop / tracetest