open-feature / flagd

A feature flag daemon with a Unix philosophy
https://openfeature.dev
Apache License 2.0
569 stars 67 forks source link

[BUG] Spans from flag evaluation not being created in Java #1369

Open julianocosta89 opened 4 months ago

julianocosta89 commented 4 months ago

Observed behavior

When updating the dependencies for the AdService in the OTel demo, I've noticed that the flagd spans are not being created in Java. We do have the flagd spans connected with RecommendationService (Python), CartService (.NET) and Frontend-proxy (Envoy).

But for AdService no. Even though, we follow the recommended approach documented here: https://flagd.dev/providers/java/#opentelemetry-tracing-rpc-only

https://github.com/open-telemetry/opentelemetry-demo/blob/main/src/adservice/src/main/java/oteldemo/AdService.java#L86-L94

Expected Behavior

Whenever AdService checks the state of the FeatureFlag, I'd expect 3 spans from flagd to be created, as it is happening with other services.

Steps to reproduce

git clone git@github.com:open-telemetry/opentelemetry-demo.git
cd opentelemetry-demo
docker compose up -d

Navigate to Jaeger UI: http://localhost:8080/jaeger/ui/

Check the received traces from AdService and Flagd.

Kavindu-Dodan commented 4 months ago

@julianocosta89 I had a look into this. The service use withGlobalTelemetry(true) which internally obtains an OpenTelemetry instance from GlobalOpenTelemetry.get(). This follows Java manual instrumentation guidelines [1]. And in the AdService, there is no manual configuration initializations and this is the root cause for this observation.

[1] - https://opentelemetry.io/docs/languages/java/instrumentation/#manual-configuration

julianocosta89 commented 4 months ago

@Kavindu-Dodan we are using the OTel Java agent, isn't that done automatically? We have another service, written in Kotlin, and for that it is working fine. https://github.com/open-telemetry/opentelemetry-demo/blob/main/src/frauddetectionservice/src/main/kotlin/frauddetectionservice/main.kt#L32-L36

julianocosta89 commented 4 months ago

I've done some tests again and it seems that it is actually creating spans, but for some reason just once. Here is the trace containing the flagd spans: Screenshot 2024-07-25 at 12 21 23

Here are the metrics from one of the flags from AdService: Screenshot 2024-07-25 at 12 20 43

As we can see, the flag is being queried every time AdService is called, but the spans from flagd are just arriving once. Maybe we have some error in the implementation of AdService.

beeme1mr commented 4 months ago

Hey @julianocosta89, this is an optimization in the Java provider that can disabled. Basically, a value can be cached if it's "static" and the provider is able to establish a persistent connection with flagd to bust the cache when the flag configuration changes.

Here are the configuration options that are available. https://github.com/open-feature/java-sdk-contrib/tree/main/providers/flagd#configuration-options

julianocosta89 commented 4 months ago

Hey @julianocosta89, this is an optimization in the Java provider that can disabled. Basically, a value can be cached if it's "static" and the provider is able to establish a persistent connection with flagd to bust the cache when the flag configuration changes.

Here are the configuration options that are available. https://github.com/open-feature/java-sdk-contrib/tree/main/providers/flagd#configuration-options

Oh, cool TIL.

Any reason the behaviour is different for Java and Kotlin? Both are using the same dependencies and none have the cache disabled.

beeme1mr commented 4 months ago

There are a few reasons why a value wouldn't be cashed. For example, if there's a targeting rule in the flag configuration. I'll take a look tomorrow.

beeme1mr commented 4 months ago

Hey @julianocosta89, can you confirm that the kafkaQueueProblems doesn't have a targeting rule defined in your environment? That would immediately make it ineligible to be cached. I didn't notice any obvious reasons this isn't being cached, but I didn't have a chance to investigate in depth.

julianocosta89 commented 4 months ago

this is what I currently have for kafkaQueueProblems:

"kafkaQueueProblems": {
  "description": "Overloads Kafka queue while simultaneously introducing a consumer side delay leading to a lag spike",
  "state": "ENABLED",
  "variants": {
    "on": 100,
    "off": 0
  },
  "defaultVariant": "off"
},

And this is the kotlin snippet:

fun main() {
    val options = FlagdOptions.builder()
    .withGlobalTelemetry(true)
    .build()
    val flagdProvider = FlagdProvider(options)
    OpenFeatureAPI.getInstance().setProvider(flagdProvider)

    val props = Properties()
    props[KEY_DESERIALIZER_CLASS_CONFIG] = StringDeserializer::class.java.name
    props[VALUE_DESERIALIZER_CLASS_CONFIG] = ByteArrayDeserializer::class.java.name
    props[GROUP_ID_CONFIG] = groupID
    val bootstrapServers = System.getenv("KAFKA_SERVICE_ADDR")
    if (bootstrapServers == null) {
        println("KAFKA_SERVICE_ADDR is not supplied")
        exitProcess(1)
    }
    props[BOOTSTRAP_SERVERS_CONFIG] = bootstrapServers
    val consumer = KafkaConsumer<String, ByteArray>(props).apply {
        subscribe(listOf(topic))
    }

    var totalCount = 0L
    consumer.use {
        while (true) {
            totalCount = consumer
                .poll(ofMillis(100))
                .fold(totalCount) { accumulator, record ->
                    val newCount = accumulator + 1
                    if (getFeatureFlagValue("kafkaQueueProblems") > 0) {
                        logger.info("FeatureFlag 'kafkaQueueProblems' is enabled, sleeping 1 second")
                        Thread.sleep(1000)
                    }
                    val orders = OrderResult.parseFrom(record.value())
                    logger.info("Consumed record with orderId: ${orders.orderId}, and updated total count to: $newCount")
                    newCount
                }
        }
    }
}

/**
* Retrieves the status of a feature flag from the Feature Flag service.
*
* @param ff The name of the feature flag to retrieve.
* @return `true` if the feature flag is enabled, `false` otherwise or in case of errors.
*/
fun getFeatureFlagValue(ff: String): Int {
    val client = OpenFeatureAPI.getInstance().client
    // TODO: Plumb the actual session ID from the frontend via baggage?
    val uuid = UUID.randomUUID()

    val clientAttrs = mutableMapOf<String, Value>()
    clientAttrs["session"] = Value(uuid.toString())
    client.evaluationContext = ImmutableContext(clientAttrs)
    val intValue = client.getIntegerValue(ff, 0)
    return intValue
}
toddbaert commented 3 months ago

@julianocosta89 could it be that you are evaluating the same flag in quick succession, and evaluations are interleaving? The flag value can't be cached until the RPC completes, so if 2 evaluations happen quickly, they will both cause RPCs since theirs no previously resolved value in the cache (yet)..

Besides some transient connection issue, I can't see another reason. If the reason field is coming back as STATIC, subsequent evaluations should result in cache hits with your configuration :thinking:

EDIT:

Oh, one more reason could be that the provider hasn't fully connected its stream before which no caching happens... this is plausible since you're using setProvider (which doesn't block awaiting a connection) instead of setProviderAndWait which blocks until the provider is fully initialized.