open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.07k stars 2.37k forks source link

panic: runtime error: invalid memory address or nil pointer dereference #19177

Closed arjan-saly-tfs closed 1 year ago

arjan-saly-tfs commented 1 year ago

Component(s)

receiver/sqlquery

What happened?

Description

When starting the otelcol-contrib application, it starts, gives the message: Everything is ready. Begin running and processing data. , then after a couple of seconds it fails with this message:

panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xc0000005 code=0x0 addr=0x98 pc=0x4d076c9]

Steps to Reproduce

Start with provided configuration

Expected Result

OTEL starts running and collecting metrics

Actual Result

OTEL fails with a runtime error

Collector version

otelcol-contrib Version 0.71.0

Environment information

Environment

OS: Windows 10 Enterprise System Type: 64-bit operating system, x64-based processor

OpenTelemetry Collector configuration

receivers:
  sqlquery:
    driver: postgres
    datasource: "host=localhost port=5432 user=otel password=<removed> dbname=probeer sslmode=disable"
    queries:
      - sql: "SELECT duration_active_nsec, application_name FROM otel.my_stat_activity"
        metrics:
          - metric_name: query_activity.duration
            value_column: "duration_active_nsec"
            data_type: gauge
            attribute_columns: [ "application_name" ]

service:
  pipelines:
    metrics:
      receivers: [sqlquery]
      processors: [batch, resource]
      exporters: [logging]

Log output

panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xc0000005 code=0x0 addr=0x98 pc=0x55076c9]

goroutine 112 [running]:
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/sqlqueryreceiver.dbSQLClient.metricRows.func1()
        github.com/open-telemetry/opentelemetry-collector-contrib/receiver/sqlqueryreceiver@v0.71.0/db_client.go:71 +0x49
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/sqlqueryreceiver.reusableRow.toMetricRow(...)
        github.com/open-telemetry/opentelemetry-collector-contrib/receiver/sqlqueryreceiver@v0.71.0/db_client.go:99
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/sqlqueryreceiver.dbSQLClient.metricRows({0xc000abb1e0?, 0xc000270150?, {0xc0008be000?, 0x18?}}, {0x7e5aae0?, 0xc0013e41b0?})
        github.com/open-telemetry/opentelemetry-collector-contrib/receiver/sqlqueryreceiver@v0.71.0/db_client.go:86 +0x47a
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/sqlqueryreceiver.scraper.Scrape({{{0x712d4b7, 0x10}, {0xc00076c2a0, 0x51}}, {{0xc0008be000, 0x48}, {0xc000a60500, 0x1, 0x1}}, {0x2540be400}, ...}, ...)
        github.com/open-telemetry/opentelemetry-collector-contrib/receiver/sqlqueryreceiver@v0.71.0/scraper.go:63 +0x82
go.opentelemetry.io/collector/receiver/scraperhelper.(*controller).scrapeMetricsAndReport(0xc000a770e0, {0x7e5aa70, 0xc00006c040})
        go.opentelemetry.io/collector@v0.71.0/receiver/scraperhelper/scrapercontroller.go:209 +0x223
go.opentelemetry.io/collector/receiver/scraperhelper.(*controller).startScraping.func1()
        go.opentelemetry.io/collector@v0.71.0/receiver/scraperhelper/scrapercontroller.go:191 +0xd4
created by go.opentelemetry.io/collector/receiver/scraperhelper.(*controller).startScraping
        go.opentelemetry.io/collector@v0.71.0/receiver/scraperhelper/scrapercontroller.go:180 +0x56

Additional context

Using this view definition to get some usable metrics on a simple laptop-level test database, while keeping the query in the config very simple:

create view otel.my_stat_activity
as
select 
      (extract(epoch from (psa.state_change - psa.query_start)) * 100000000)::INTEGER                               as duration_active_nsec
     ,case when coalesce(psa.application_name,'') = ''  THEN 'Backend process'  else psa.application_name   end     as application_name
     ,psa.client_addr
     ,psa.wait_event_type
     ,psa.wait_event
     ,psa.backend_type
     ,psa.usename
from pg_catalog.pg_stat_activity as psa ;

grant usage on schema otel to otel;
grant select on otel.my_stat_activity to otel;

Created in the following context: Databasename: Probeer Schema: otel Used a superuser to create the view User for query: otel, who has pg_monitorprivileges

When I login on the database using this command in PowerShell, the view can be queried without any issues: psql -U otel -d probeer -h localhost -W

github-actions[bot] commented 1 year ago

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

pmcollins commented 1 year ago

Thank you for the report. Will investigate.

arjan-saly-tfs commented 1 year ago

@pmcollins @dmitryax , found some additional information that might be of help for you.

Additional information: Nil values in metric values

It occurred to me that this specific metric can be nil in several cases. This is a valid situation for the database and, to my opinion, should be registered as such.

However, it seems like the receiver can't handle nil as a value for the metric.

How to reproduce

This is how this can be reproduced:

New view

I added a new view, which is the same as the one I already reported, but now nil values are replaced by zero:

/*************************************************************************************************************************************************
 * Replace nills on metric only
 *************************************************************************************************************************************************/
--drop view otel.my_stat_activity_no_nulls_on_metrics;

create view otel.my_stat_activity_no_nulls_on_metrics
as
select 
      COALESCE(extract(epoch from (psa.state_change - psa.query_start)) * 100000000,0)::INTEGER                     as duration_active_nsec      
     ,case when coalesce(psa.application_name,'') = ''  THEN 'Backend process'  else psa.application_name   end     as application_name
     ,psa.client_addr
     ,psa.wait_event_type
     ,psa.wait_event
     ,psa.backend_type
     ,psa.usename
from pg_catalog.pg_stat_activity as psa ;

grant select on otel.my_stat_activity_no_nulls_on_metrics to otel;

select * from otel.my_stat_activity_no_nulls_on_metrics;

Changed config.yaml

I change the OTEL config to look in the new view that replaces nil values by zero:

sqlquery:
    driver: postgres
    datasource: "host=localhost port=5432 user=otel password=<replaced> dbname=probeer sslmode=disable"
    queries:
      # - sql: "SELECT duration_active_nsec, application_name FROM otel.my_stat_activity"
      - sql: "SELECT duration_active_nsec, application_name FROM otel.my_stat_activity_no_nulls_on_metrics"
        metrics:
          - metric_name: query_activity.duration
            value_column: "duration_active_nsec"
            data_type: gauge
            value_type: int
            attribute_columns: [ "application_name" ]

When I run this configuration, it starts grabbing the metrics as expected. When I revert the configuration back to use the original view that returns nil values for metrics, it fails again.

Not sure what the general way of working is in Open Telemetry with respect to nil values in metrics. My humble opinion is that metrics should be handled as they come in. This way it's up to the analyst to decide what to do with it. Drop, replace, report on, show as it is, etc. Also, nil and zero have different impact on averages.

Either way, I don't think a receiver should cause a fatal error like this on a nil value in the metric.

Hope this additional information helps solving the issue.

arjan-saly-tfs commented 1 year ago

Additional information: Nil values in attribute values cause this error as well

Took the previous finding one step further: The receiver also fails when attributes columns hold a nil value.

How to reproduce

The view provides more columns, which are to be added as additional attribute columns. In the views provided in the original description and the previous comment, these attribute columns can hold nil values, which is a valid situation that means "For this process, this information does not exist.".

Reproduce the error on attribute columns

So, now I add the columns to the query and the "attribute_columns" in the OTEL configuration:

  sqlquery:
    driver: postgres
    datasource: "host=localhost port=5432 user=otel password=<replaced> dbname=probeer sslmode=disable"
    queries:
      # - sql: "SELECT duration_active_nsec, application_name FROM otel.my_stat_activity"
      # - sql: "SELECT duration_active_nsec, application_name FROM otel.my_stat_activity_no_nulls_on_metrics"
      - sql: "SELECT duration_active_nsec, application_name, client_addr, wait_event_type, wait_event, backend_type, usename from otel.my_stat_activity_no_nulls_on_metrics;"
      # - sql: "SELECT duration_active_nsec, application_name, client_addr, wait_event_type, wait_event, backend_type, usename from otel.my_stat_activity_no_nulls;"
        metrics:
          - metric_name: query_activity.duration
            value_column: "duration_active_nsec"
            data_type: gauge
            value_type: int
            # attribute_columns: [ "application_name" ]
            attribute_columns: [ "application_name", "client_addr", "wait_event_type", "wait_event", "backend_type", "usename" ]

Again, the OTEL fails at first try to gather metrics. Keep in mind that I use the view where the nil values in the metrics are replaced by zero, so they are not causing this:

2023-03-03T13:03:26.619+0100    info    service/service.go:157  Everything is ready. Begin running and processing data.
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xc0000005 code=0x0 addr=0x98 pc=0x56776c9]

goroutine 96 [running]:
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/sqlqueryreceiver.dbSQLClient.metricRows.func1()
        github.com/open-telemetry/opentelemetry-collector-contrib/receiver/sqlqueryreceiver@v0.71.0/db_client.go:71 +0x49
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/sqlqueryreceiver.reusableRow.toMetricRow(...)
        github.com/open-telemetry/opentelemetry-collector-contrib/receiver/sqlqueryreceiver@v0.71.0/db_client.go:99
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/sqlqueryreceiver.dbSQLClient.metricRows({0xc00127e750?, 0xc000731260?, {0xc001271040?, 0x18?}}, {0x7fcaae0?, 0xc00127ccc0?})
        github.com/open-telemetry/opentelemetry-collector-contrib/receiver/sqlqueryreceiver@v0.71.0/db_client.go:86 +0x47a
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/sqlqueryreceiver.scraper.Scrape({{{0x729d4b7, 0x10}, {0xc000f700b0, 0xa7}}, {{0xc001271040, 0x9e}, {0xc000f3cd20, 0x1, 0x1}}, {0x2540be400}, ...}, ...)
        github.com/open-telemetry/opentelemetry-collector-contrib/receiver/sqlqueryreceiver@v0.71.0/scraper.go:63 +0x82
go.opentelemetry.io/collector/receiver/scraperhelper.(*controller).scrapeMetricsAndReport(0xc000a72240, {0x7fcaa70, 0xc000072030})
        go.opentelemetry.io/collector@v0.71.0/receiver/scraperhelper/scrapercontroller.go:209 +0x223
go.opentelemetry.io/collector/receiver/scraperhelper.(*controller).startScraping.func1()
        go.opentelemetry.io/collector@v0.71.0/receiver/scraperhelper/scrapercontroller.go:191 +0xd4
created by go.opentelemetry.io/collector/receiver/scraperhelper.(*controller).startScraping
        go.opentelemetry.io/collector@v0.71.0/receiver/scraperhelper/scrapercontroller.go:180 +0x56

Reproduce no error on attribute columns

For this I create another new view where all nil values in the attributes are replaced by an empty string ('') or in case of the ip address the fictional value '1.1.1.1', as PostgreSQL won't accept an empty string for datatype inet:

/*************************************************************************************************************************************************
 * Replace nills on metric and attributes
 *************************************************************************************************************************************************/
drop view otel.my_stat_activity_no_nulls;

create view otel.my_stat_activity_no_nulls
as
select 
      COALESCE(extract(epoch from (psa.state_change - psa.query_start)) * 100000000,0)::INTEGER                         as duration_active_nsec     
     ,case when coalesce(psa.application_name,'') = ''  THEN 'Backend process'      else psa.application_name   end     as application_name
     ,coalesce(psa.client_addr      ,'1.1.1.1')                                                                         as client_addr                                                                                                                
     ,coalesce(psa.wait_event_type  ,'')                                                                                as wait_event_type
     ,coalesce(psa.wait_event       ,'')                                                                                as wait_event     
     ,coalesce(psa.backend_type     ,'')                                                                                as backend_type   
     ,coalesce(psa.usename          ,'')                                                                                as usename        
from pg_catalog.pg_stat_activity as psa ;

grant select on otel.my_stat_activity_no_nulls to otel;

select * from otel.my_stat_activity_no_nulls;

And I change the OTEL configuration to start using this view:

  sqlquery:
    driver: postgres
    datasource: "host=localhost port=5432 user=otel password=<replaced> dbname=probeer sslmode=disable"
    queries:
      # - sql: "SELECT duration_active_nsec, application_name FROM otel.my_stat_activity"
      # - sql: "SELECT duration_active_nsec, application_name FROM otel.my_stat_activity_no_nulls_on_metrics"
      # - sql: "SELECT duration_active_nsec, application_name, client_addr, wait_event_type, wait_event, backend_type, usename from otel.my_stat_activity_no_nulls_on_metrics;"
      - sql: "SELECT duration_active_nsec, application_name, client_addr, wait_event_type, wait_event, backend_type, usename from otel.my_stat_activity_no_nulls;"
        metrics:
          - metric_name: query_activity.duration
            value_column: "duration_active_nsec"
            data_type: gauge
            value_type: int
            # attribute_columns: [ "application_name" ]
            attribute_columns: [ "application_name", "client_addr", "wait_event_type", "wait_event", "backend_type", "usename" ]

Now OTEL starts gathering data nicely as it should.

However, it is not gathering the entire truth as nil is replaced by an empty string, which in the backend can result in different alerts or report-effects and the unknown IP address (valid for internal processes) is replaced by some value, which is never correct.

Again, hope this helps solving the issue better and/or faster.

pmcollins commented 1 year ago

I have a tentative fix for this but it requires a bit of refactoring and add'l tests. I have a higher priority task I need to take care of first so this fix should be ready next week. In the meantime, avoid NULL values in select statements passed to this receiver.

arjan-saly-tfs commented 1 year ago

Thanks for informing me @pmcollins . We're currently in exploration phase to see to what extent Open Telemetry is capable of fulfilling our needs with respect to database monitoring. For this stage, avoiding NULL values is acceptable.

Having it fixed next week works fine for us, thanks.

dmitryax commented 1 year ago

Should be fixed by https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/19750

arjan-saly-tfs commented 1 year ago

Perhaps I misunderstand, but I found this in the Readme file: "NULL values Avoid queries that produce any NULL values. If a query produces a NULL value, a warning will be logged. Furthermore, if a configuration references the column that produces a NULL value, an additional error will be logged. However, in either case, the receiver will continue to operate."

What's not clear to me, is what happens with metrics that produced the data and has nill in either the metric or an attribute? Will it be forwarded anyway? Or is it dropped? (Can't test because most recent build is from a week ago, before you added this fix.)

To me, this description looks like it still can't handle nill values, it just doesn't crash on it anymore. Obviously better than it was, but to my opinion, it should be able to forward the nill to the endpoint. I mean, the fact that part of requested data is not available at a given moment can be very useful information as well!

Hence, is this supposed to be a final solution? Or is it meant to be a workaround, waiting for a proper solution?

Thanks for the work done so far and thanks for your upcoming answers.

Regards, Arjan

pmcollins commented 1 year ago

Hi @arjan-saly-tfs yes with this change, the metric will be skipped if its value is null (I will update the doc to make that clear). It sounds like this is not what you're looking for though. If you wanted some other value for the metric when it's corresponding value is null, could you use something like Postgres's coalesce? Or is there some other functionality you're looking for? Alternatively, feel free to write up an issue with your idea. Thanks.

arjan-saly-tfs commented 1 year ago

Hi @pmcollins,

What I'm looking for is that NULL is accepted as a valid "value" and thus passed through "as-is". Simply put, I don't think the source should decide what to do with NULL. Instead, this decision should me made on the visualization side.

To explain a more on this: We currently use Datadog for data storage and visualization. Datadog can handle NULL values in different ways, for instance it can calculate most acceptable value based on interpolation on surrounding data, with several different interpolation techniques to choose from. Obviously it can also drop the metric or replace it by a default.

I know of cases where an IF THEN construction is used on metrics that come from different sources. For instance: Prefer CPU usage coming from PostgreSQL, but use the OS value if PostgreSQL doesn't deliver. Datadog can do this, but I guess this might be done by a processor as well (although I doubt if that is the right place, see below).

Decisions like this may even depend on who evaluates the value. For instance, the example given might be of interest for "best accuracy" evaluation of CPU usage by PostgreSQL. But to see how busy a database is, the absence of CPU usage may indicate the database is overloaded (if it can't provide a simple figure like this in a timely fashion.....). Two ways of looking at the data, both very useful. Hence, this indicates the decision of "what to do with NULL values" should not be made at source side, but at visualization side.

The last example also shows that NULL, which means absence of data, may give useful information on itself. As in the example: If a simple metric can't be delivered, it might mean we're overloading the database with queries. We might decrease Open Telemetry frequency, or improve other queries on the database. Of course this is just an example, there are many more cases where NULL provides useful information.

Then the impact on aggregation: NULL is ignored when aggregating, while zero is not. Especially for averages and counts this can have a large impact on the result.

Please note that NULL means "this specific column in the row does not have data, but other columns in the record do", while " no row returned" means "no data found at all for this query". They both give absence of data, but on a different level of granularity. The first gives more detail. Also, the latter is harder to detect for being absence. For detection purposes a query may be constructed in a way where the row does exist, but the metric (or an attribute) may give NULL. Only then we now for sure the measurement runs fine, but the specific metric or attribute is not delivered.

In the end, I don't think NULL handling should be done at source/receiver level. I don't even think it should be done by Open Telemetry. Open Telemetry should simply pass through, while the visualizer decides what to do with it.

Using coalesce to replace NULL values means we decide at the source and we loose useful information as we can't detect any difference between zero and "not available". Same applies to using coalesce to replace NULL by an empty string (for attributes). An empty string can be stored as such in the column. If we replace NULL by empty string, we loose the difference between "unknown" and "set to an empty string on purpose". Hence, loosing information on details that may be useful. Obviously NULL can be stored in a column as well, but then it really means "unknown" / "absent", which is the same meaning as "not found on outer join".

I hope this helps you understanding my expectations and why using coalesce is not a final solution to my opinion.

btw: Please understand that I'm not a developer in the sense of building applications, I'm a Business Intelligence Specialist and Data Processing specialist. My development skills are limited to SQL Queries and Stored Procedures for Relational databases. My expertise lies in the translation of separate values into information, thus I also know the impact of loosing detail on the wrong place. But I don't have a clue how difficult I make it for you in this request.... ;-)

pmcollins commented 1 year ago

To my knowledge, there isn't really a concept of null in an otel metric. cc: @dmitryax to correct me if I'm wrong :)

Aneurysm9 commented 1 year ago

null is neither an integer nor a floating point number and thus cannot be represented as a value in OTel data points. It is not a string, boolean, integer, floating point number, or homogeneous array of such types and thus cannot be represented as an attribute value in OTel metrics. There is a flag that can be set on a data point to indicate that it has no recorded value. This is probably the closest there is to a concept of null in the OTel data model, but there is no specification detailing how it is to be interpreted and its only known (to me) use is for conveying Prometheus staleness markers.

arjan-saly-tfs commented 1 year ago

This page in the OTel concepts confirms what you're saying for attributes: https://opentelemetry.io/docs/reference/specification/common/ I'm not sure if I fully agree on this concept, but I think that discussion has to be put on a different level. This receiver certainly should follow the concepts of the whole,

For metrics however, it's not explicitly mentioned (at least not in a way that I can find it as a newby). However, using a flag to mark a datapoint which doesn't have a recorded value, seems to serve the same goal as NULL does to me: Know that the metric value is unknown.

Going back to the origin of this discussion, this does mean that the receiver still should accept NULL values for metrics, which is not the case at this moment. The only twist in the storyline is that NULL may not be passed through as I stated, but is translated into the flag you mentioned.

Hopefully how our visualization solution (datadog) can handle these flags.... :-)

arjan-saly-tfs commented 1 year ago

@Aneurysm9 mentioned the following: "There is a flag that can be set on a data point to indicate that it has no recorded value."

Is this already available for this receiver? If so, how can I activate it? If not, when will it be part of the receiver?