open-telemetry / opentelemetry-demo

This repository contains the OpenTelemetry Astronomy Shop, a microservice-based distributed system intended to illustrate the implementation of OpenTelemetry in a near real-world environment.
https://opentelemetry.io/docs/demo/
Apache License 2.0
1.73k stars 1.1k forks source link

Demo environment generates errors by default #1628

Open flands opened 3 months ago

flands commented 3 months ago

Bug Report

Which version of the demo you are using? 1.10.0

Symptom

If you start the demo environment from scratch, errors are reported for the adservice.

What is the expected behavior?

Either:

  1. The demo environment doesn't generate errors by default - currently how the documentation reads: https://opentelemetry.io/docs/demo/#scenarios
  2. The demo environment does generate errors by default but these errors are documented and thus expected.

What is the actual behavior?

The adservice generates errors by default yet the documentation seems to indicate you must enable scenarios to generate errors and other problems.

Reproduce

Provide the minimum required steps to result in the issue you're observing.

We will close this issue if:

Additional Context

Logs messages for adservice will show: ad-service | 2024-06-23 15:11:51 - oteldemo.AdService - GetAds Failed with status Status{code=UNAVAILABLE, description=null, cause=null} trace_id=d963f87608e1ab611dee31ef9ac29860 span_id=84ce83545d6852bb trace_flags=01

src/flagd/demo.flagd.json shows:

    "adServiceFailure": {
      "description": "Fail ad service",
      "state": "ENABLED",
      "variants": {
        "on": true,
        "off": false
      },
      "defaultVariant": "off",
      "targeting": {
        "fractional": [
          ["on", 10],
          ["off", 90]
        ]
      }
    },

The problem is off which should be set to 100 by default

puckpuck commented 3 months ago

Cart service also needs to be updated. We should do them both in the same PR that follows a format noted in this comment

julianocosta89 commented 3 months ago

Actually I've tried the solution mentioned by @beeme1mr and everything broke.

recommendation-service   | grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
recommendation-service   |  status = StatusCode.NOT_FOUND
recommendation-service   |  details = "FlagdError:, FLAG_DISABLED"
recommendation-service   |  debug_error_string = "UNKNOWN:Error received from peer ipv4:172.20.0.5:8013 {grpc_message:"FlagdError:, FLAG_DISABLED", grpc_status:5, created_time:"2024-06-29T03:51:25.86731025+00:00"}"
product-catalog-service  | 2024/06/29 03:51:25 openfeature: FLAG_NOT_FOUND: not_found: FlagdError:, FLAG_DISABLED

It seems that when disabling the Feature Flag, it doesn't return false, as expected.

beeme1mr commented 3 months ago

Hey @flands and @julianocosta89, I'll look into this tomorrow. When the flag is disabled, the SDK uses the default value defined in the code. The message @julianocosta89 posted is likely just an overly verbose log message.

julianocosta89 commented 3 months ago

@beeme1mr I think some services do not default to false 😞

dyladan commented 3 months ago

@julianocosta89 I looked into this a bit and there are a few takeaways:

  1. For some reason in my environment (macos) changing the demo.flagd.json is not triggering changes to the flag definitions in flagd. Restarting the flagd service reflects the changes. This may be only affecting macos.
  2. If a flag is disabled, the python SDK is very verbose in its logs. The recommendationServiceCacheFailure flag does appear to be correctly falling back to its False default value. The logs are happening within the openfeature SDK.

I talked to @beeme1mr and he is looking into the verbose logging in the SDK. He agrees this situation isn't ideal and maybe shouldn't be logged the same way as other "real" failures. He's also going to look into why the flag file changes aren't being picked up by flagd in the demo setup.

julianocosta89 commented 3 months ago

Thanks for taking a look at it @dyladan! Interesting enough when I update my feature flags it works fine, without having to restart the service. I'm running on the demo on macOS M1

dyladan commented 3 months ago

@julianocosta89 I was able to track the flagd reload issue down to my specific setup. Apparently in colima (the container runtime I'm using) the WRITE event is not triggered when I write a mounted file. You can probably ignore it for now.

beeme1mr commented 3 months ago

I also have a quick update. We currently treat disabled flags like missing flags. That means we'll fall back to whatever is defined in the code, but we're also noisy about it because it's assumed you're accidentally using a flag. Obviously, that's not the ideal experience here, and we're working on a solution. It may take a few days to fully implement, but we're actively working on it and will provide an update ASAP.

julianocosta89 commented 3 months ago

Thanks for the updates @dyladan and @beeme1mr!