jdbc-observations / datasource-micrometer

Micrometer Observation API instrumentation and Spring Boot 3 Auto-Configuration for JDBC APIs
Apache License 2.0
50 stars 8 forks source link

Connection Observation Issues #26

Closed pbelousov closed 7 months ago

pbelousov commented 7 months ago

We started using the library in Spring Boot 3 projects configured to send metrics and traces through OpenTelemetry. After rolling out the service, we noticed that numerous requests use the same traceId, collapsing all spans under one trace.

After spending a couple of hours, I found that the issue relates to the CONNECTION trace type. When we exclude it from tracing (jdbc.includes=QUERY,FETCH), everything goes well.

From what I see in the DataSourceObservationListener class, you start observations for the CONNECTION type but only stop them in case of exceptions or when the connection has been closed. What if the connection is opened and not closed yet? I'm not an expert in writing custom observation handlers, but at first glance, it looks suspicious, as other examples I've seen always stop observations created within the scope of the request. Actually, I feel like QUERY and RESULT_SET observations can also have a similar issue if any exceptions occur before stopping them.

If my understanding is incorrect, please help me with the next steps to investigate the original issue of having the same traceId for multiple requests.

Thank you!

pbelousov commented 7 months ago

Actually, I feel like QUERY and RESULT_SET observations can also have a similar issue if any exceptions occur before stopping them.

Yep... after some period of time (some issues with stopping observations?) an application starts reporting the same traceIds for requests that even don't use JDBC...

ttddyy commented 7 months ago

It sounds like you have instrumented the JDBC object(DataSource, Connection, etc) used by the connection pool.

You need to instrument the datasource that your application directly uses. Typically: Application -> Instrumented proxy DataSource -> Connection Pool

You want to observe how your application interacts with DataSource, not how the connection pool library interacts with JDBC objects.

Connection pools manage connections independently from the application. For example, it does not close connections in order to reuse them. However, applications should always follow the proper JDBC API lifecycle, closing resources(connection, statement, resultset); otherwise, it causes resource leaks.

Yep... after some period of time (some issues with stopping observations?) an application starts reporting the same traceIds for requests that even don't use JDBC...

So, looks like the problem resides somewhere else. I'll go ahead and close this issue.

pbelousov commented 7 months ago

It sounds like you have instrumented the JDBC object

But that's what the library does automatically in the bean post processor.

You need to instrument the datasource that your application directly uses.

Does that mean, in our case, we cannot use datasource-micrometer-spring-boot and have to handle all the instrumentation ourselves?

ttddyy commented 7 months ago

But that's what the library does automatically in the bean post processor.

Please debug and make sure it is instrumenting the appropriate datasource.

Is your application manually managing the connections and intentionally leaving them open? From the observation perspective, if connections are not closed, there is no clear indication of when the interaction finishes, so it cannot close the observation. Getting a connection and closing it is the lifecycle of the observation in this library, which aligns with the JDBC API usage.

I suggest investigating why your application leaves JDBC resources open and consider managing them appropriately.

pbelousov commented 6 months ago

Well... I finally debugged the issue, and it turns out you fixed it in the datasource-proxy 1.10 release. Let me explain for the sake of history.

We use a similar approach to what is described in the following post - we have Read-Only (RO) and Read+Write (RW) datasource beans. Additionally, we have a routing bean (which extends AbstractRoutingDataSource) responsible for directing transactions to the correct underlying datasource based on whether it is a readonly transaction or not.

All these beans are wrapped by ProxyDataSource, resulting in a chain of classes like: ProxyDataSource1 -> AbstractRoutingDataSourceImplementation -> ProxyDataSource2 -> ReadOnlyDataSource.

When Spring Boot calls the getConnection method, the observation begins. When the connection is closed, the observation stops. Easy. However, in our case, this process occurred in the wrong order - an observation was initially created from ProxyDataSource2, and subsequently new one created from ProxyDataSource1. Both were closed in the same order, closing the nested observation first.

The issue arose from the incorrect order of calling the 'beforeMethod' of ProxyDataSource2 first (here is the line), which was the root cause of the problem.

This problem was addressed (whether intentionally or not) with the introduction of ProxyLogicSupport. However, since this fix was only released three weeks ago and the datasource-micrometer library still uses version 1.8.1, we encountered the problem. Upon manually upgrading to version 1.10 of datasource-proxy, the issue disappeared (at least locally).

I don't think our use case is too frequent, but anyway, I think upgrading datasource-proxy can make sense.

ttddyy commented 5 months ago

@pbelousov Thanks for the detailed explanation. Good to hear that the latest datasource-proxy fixed the issue. The fix was not intentional for this usecase but good it covered here.

About the dependency version upgrade, I thought of upgrading the datasource-proxy version in the patch release. The problem is that datasource-proxy does not follow semantic versioning. Even though datasource-proxy 1.10/1.9/1.8.x are compatible, I just left the version 1.8.1 in datasource-icrometer since it becomes a minor version bump and not a patch version release.

So, for now, please override to use datasource-proxy 1.10 or above, as you did.

I need to think about versioning on the datasource-proxy side.