measurement residual in SDK

zhchai commented 1 year ago

Describe your environment opentelemetry-cpp 1.9.0 SDK

Steps to reproduce Using one instrument to observe memory of multiple processes, process can appear and disappear dynamically. Measurements of disappeared process are still transmitted out to opentelemetry collector. code snippet:

  auto exporter = otlp_exporter::OtlpGrpcMetricExporterFactory::Create(options);

  std::string version{"1.2.0"};
  std::string schema{"https://opentelemetry.io/schemas/1.2.0"};

  // Initialize and set the global MeterProvider
  metric_sdk::PeriodicExportingMetricReaderOptions options;
  options.export_interval_millis = std::chrono::milliseconds(1000);
  options.export_timeout_millis  = std::chrono::milliseconds(500);
  std::unique_ptr<metric_sdk::MetricReader> reader{
      new metric_sdk::PeriodicExportingMetricReader(std::move(exporter), options)};
  auto provider = std::shared_ptr<metrics_api::MeterProvider>(new metric_sdk::MeterProvider());
  auto p        = std::static_pointer_cast<metric_sdk::MeterProvider>(provider);
  p->AddMetricReader(std::move(reader));

  metrics_api::Provider::SetMeterProvider(provider);

void foo_library::observable_process_example(const std::string &name)
{
  std::string observableProcess = name + "_process_metrics_v1";
  auto provider                               = metrics_api::Provider::GetMeterProvider();
  nostd::shared_ptr<metrics_api::Meter> meter = provider->GetMeter(name, "1.2.0");
  auto process_mem_counter = meter->CreateInt64ObservableGauge(observableProcess);
  process_mem_counter->AddCallback(MeasurementFetcher::process_callback, nullptr);
  while(1)
  {
      std::this_thread::sleep_for(std::chrono::seconds(1));
  }
}

  static void process_callback(opentelemetry::metrics::ObserverResult observer, void * /* state */)
  {
    auto observer_int = nostd::get<nostd::shared_ptr<opentelemetry::metrics::ObserverResultT<int64_t>>>(observer);
    for (int n: processes)
    {
        process_memory = getProcessMemory(n); //get processes memory info for pid n
        observer_int ->Observe(process_memory, {{"process_name", process.name},{"pid", n}});
    }

  }

What is the expected behavior? If the process exit, it's metrics should not be seen in opentelemetry collector.

What is the actual behavior? Measurements of disappeared process are still transmitted out to opentelemetry collector.

Additional context see code snippet.

lalitb commented 1 year ago

Duplicate to #2184. I am working on the fix, tentatively planned to be resolved during this week.

lalitb commented 1 year ago

@zhchai ~~I see that you are using default temporality for OTLP exporter, which is Cumulative. With this temporality, even when there is no new measurement for a given instrument within a collection period, the previous metric value would be exported. With delta temporality, if there is no new measurement within a collection period, no new metrics would be generated for that instrument. You can configure the temporality to Delta and try again ?~~

EDIT - Please ignore my comment. I realized this is Observable Gauge and should only return the values between the collection interval. Will get this fixed.

gaochuang commented 1 year ago

@lalitb May I ask what is the cause of this issue? And When will this be fixed? Thanks.

lalitb commented 1 year ago

@gaochuang the fix is in the temporal storage for the observable aggregation, targeting to fix during this week and for the next release.

gaochuang commented 1 year ago

@lalitb Thanks.

lalitb commented 1 year ago

@gaochuang @zhchai - Sorry for all the confusion here. I had to revisit our implementation, and also see the behavior of other language implementations (JS, .Net) here. So, to summarize the findings:

For an Observable Gauge with Last Value Aggregation (irrespective of the temporarily configured), the exporter will emit metrics for all unique sets of attributes observed since the start, even if there are no new measurements generated for a specific set of attributes in subsequent reporting periods.

This is how it is implemented in C++, and also in JS and .Net as I validated.

In the example shared here, if there are no new measurements for given process, every collection will still have the metric points for them with the last aggregated timestamp. Perhaps this timestamp can be used in backend as the criteria for when was the process active last.

Not directly related to this issue, but I noticed that we don't have the required unit-tests for Observable Gauge to do validation, I will add them as part of closing this issue.

gaochuang commented 1 year ago

@lalitb Thanks for the clarification. Would like to ask a question, if I want to avoid this behavior, how can I do it? By the way, Do you have a guide or examples?

zhchai commented 1 year ago

if there are no new measurements for given process, every collection will still have the metric points for them with the last aggregated timestamp. Perhaps this timestamp can be used in backend as the criteria for when was the process active last.

Actually we find that the timestamps of these measurements of exited processes are always updated, which is not match with "the last aggregated timestamp".

lalitb commented 1 year ago

if I want to avoid this behavior, how can I do it

Ideally I won't advice using the process_name/pid as key's in the measurement attributes. Please read about the cardinality limit. The set of possible values an attribute set can have should be finite, and SDK recommends it to be 2000. In this case, for the attribute set {{"process_name", process.name},{"pid", n}}, the possible values are infinite. As the processes keep on created/terminated over time in given system, and new attribute values keep generating, and so is the aggregated data points for these attribute values. And these aggregated data points are stored in the SDK memory forever (in case of last-value aggregation), which can lead to cardinality explosion. The recommendation would be instead to

Use System Metrics semantic conventions to collect the memory usage in the given system. Ofcourse, this will not provide per-process memory usage.
Use Process Metrics to collect memory usage of given process.

Unfortunately, I don't see ideal way of creating an application to sending memory-usage of all the currently running processes as measurement. I would recommend creating a GitHub issue in otel-specs repo asking for recommendation.

Actually we find that the timestamps of these measurements of exited processes are always updated

I earlier tested with console exporter, and it shows the last measurement timestamp, and same is specified in specs, will check with OTLP exporter too if it behaves differently.

github-actions[bot] commented 1 year ago

This issue was marked as stale due to lack of activity.

open-telemetry / opentelemetry-cpp

measurement residual in SDK #2199