open-telemetry / opentelemetry-python-contrib

OpenTelemetry instrumentation for Python modules
https://opentelemetry.io
Apache License 2.0
705 stars 589 forks source link

Missing db.statement / db.operation sensitive data sanitization option #1543

Open nemoshlag opened 1 year ago

nemoshlag commented 1 year ago

According to semantic-conventions, a DB instrumentation should have the option to sanitize sensitive data from either db.statement or db.operation attributes. Consider using instrumentation.dbapi to reduce instrumentation-specific logic & tests. This issue seems relevant for the following instrumentations:

13042023 Edit: Updating this issue with database specifications to add db.statement attribution only if it's sanitized. image https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/database.md#call-level-attributes

avzis commented 1 year ago

I think that the default behavior should be to sanitize the information, and give an option to display the information if needed.

What do you think? for example: #1512

BuffaloWill commented 1 year ago

Should psycopg2 be included in this list? db.statement already looks sanitized (maybe from the parameterized query?). For example:

{
  "value": "INSERT INTO \"db_model\" (\"title\") VALUES (%s) RETURNING \"db_model\".\"id\"",
  "key": "db.statement"
}

In my case I would like to see the raw db.statement instead of the sanitized one.

nemoshlag commented 1 year ago

Since psycopg2 uses dbapi I think you can achieve raw db.statement by setting capture_parameters=True

BuffaloWill commented 1 year ago

I thought so to, but maybe it's not implemented? capture_parameters=true wasn't working for me.

https://github.com/open-telemetry/opentelemetry-python-contrib/blob/main/instrumentation/opentelemetry-instrumentation-psycopg2/src/opentelemetry/instrumentation/psycopg2/__init__.py#L143-L154

nemoshlag commented 1 year ago

Yes, it needs to be added on psycopg2 side, similar to the way enable_commenter does. Here's how it looks from dbapi side.

BuffaloWill commented 1 year ago

Should I create a separate issue for this?

This is how I expected it to work in my code:

Psycopg2Instrumentor().instrument(capture_parameters=True)
nemoshlag commented 1 year ago

Pls do, I'll add it to the aggregated list

tammy-baylis-swi commented 1 year ago

Hi, I've got 2 questions:

  1. If we add db.statement only if it's sanitized, does it make sense if we always sanitize and always set db.statement?
  2. For the opt-in of adding original values, should the values be the full, unsanitized statement or a list of params passed in? Would it depend on the instrumentor?

For example, let's say a database client uses psycopg2 connection cursor to make (a) a parameterized query and (b) a hard-coded query:

# (a)
statement_a = "SELECT * FROM city WHERE id = %s"
city_id = "1818"
cursor.execute(statement_a, [city_id])

# (b)
statement_b = "SELECT * FROM city WHERE id = 1818"
cursor.execute(statement_b)

For both queries, I would expect db.statement to always be set and always sanitized so that it would be "SELECT * FROM city WHERE id = %s" for both (a) and (b). Or maybe a different substitution like "SELECT * FROM city WHERE id = 0".

If the user has opted into retaining the original values, what would they be? Would it be an extraction from (b) so that both (a) and (b) would set "original" parameters ["1818"]? Or would it be an interpolation for (a) so that both (a) and (b) set "original" statement "SELECT * FROM city WHERE id = 1818"? Or something else?

Looking at some existing code: In Python DBAPI instrumentor it captures db.statement.parameters as args (in this example it'd be ["1818"] and only for (a)). In the Python mongodb instrumentor we're capturing the full statement -- I don't know mongo very well so maybe there's a reason for this.