newrelic / newrelic-java-agent

The New Relic Java agent
Apache License 2.0
202 stars 143 forks source link

Negative `totalCallTime`, `totalExclusiveTime` and very big `sumOfSquares` received #1891

Closed KGOH closed 4 months ago

KGOH commented 5 months ago

Description

I receive stats with bad data from a single java app within the same transaction and the whole system metrics as a result. I haven't encountered such behavior on other java apps with the same agent. callCount and min/max call time may be incorrect too though they don't look like such in this example:

callCount:          1742
totalCallTime:      -638.41095
totalExclusiveTime: -638.41095
minCallTime:        0.005053215
maxCallTime:        17.494854
sumOfSquares:       4.207051618499706e25 // (NOTE the e25 here)
callCount:          406
totalCallTime:      -2350984400
totalExclusiveTime: -2350984400
minCallTime:        0.00000238
maxCallTime:        0.011196813
sumOfSquares:       278452658588430.66

Expected Behavior

I expect:

  1. Positive totalCallTime and totalExclusiveTime.
  2. sumOfSquares <= maxCallTime*maxCallTime*callCount

Steps to Reproduce

I have no idea how to reproduce this. It's just a random behavior from a single app .

Your Environment

java agent version 8.11.0, OpenJDK 17.0.8

Additional context

Looks like my problem is related to this issue https://github.com/newrelic/newrelic-java-agent/issues/709 And this PR https://github.com/newrelic/newrelic-java-agent/pull/1685/files

workato-integration[bot] commented 5 months ago

https://new-relic.atlassian.net/browse/NR-268264

kford-newrelic commented 5 months ago

Hi @KGOH thanks for letting us know about this issue. A few clarifying questions:

Have you noticed this occurring more than once? If yes, approximately how often does it occur? Any repeatable pattern that you can discern?

KGOH commented 5 months ago

@kford-newrelic this happens constantly, approx. every 15 minutes. But only for a specific transaction on the specific java app. It contains many broken metrics within this transaction, such as accessing db etc. I suspect that the transaction may be asynchronous as I see the lettuce lib in the call stack

kford-newrelic commented 5 months ago

Any chance you can create a repro app from that transaction that you can share with us?

If we can get a repro, we might stand a chance of figuring this out...

KGOH commented 5 months ago

@kford-newrelic unfortunately, I don't know how to reproduce this :(

jtduffy commented 5 months ago

Is there anything unique with regards to the problem transaction? Does it use make external calls or use the Process API in anyway? Does it use any sort of async framework? Would it be possible to get some source code or pseudo-code?

We're obviously looking for any pointers in what the underlying cause is since we've been unable to repro this ourselves.

KGOH commented 5 months ago

@jtduffy here are transaction metrics I managed to extract. I erased sensitive information, but this should still convey the overall idea of technologies used

|                                                                                                   :name | :avg_calls | :avg_time |
|---------------------------------------------------------------------------------------------------------+------------+-----------|
|                                                                       Datastore/operation/Redis/PEXPIRE |      5.214 |   -30.283 |
|                                                                         Datastore/operation/Redis/RPUSH |     29.951 |   -17.632 |
|                                                                          Datastore/operation/Redis/HGET |      6.600 |   -10.324 |
|                                                                                              Java/other |     23.758 |    -8.868 |
|                                                                 Datastore/statement/Postgres/.../select |     19.414 |    -6.124 |
|                                                                       Datastore/operation/Redis/HEXISTS |      1.196 |    -2.473 |
|                                                                           Datastore/operation/Redis/GET |      0.000 |    -2.529 |
|                                                             Datastore/statement/MongoDB/...Event/insert |      0.399 |    -1.712 |
|                                                                         Datastore/operation/Redis/HSCAN |      0.399 |    -1.712 |
|                             Java/io.lettuce.core.cluster.RedisAdvancedClusterAsyncCommandsImpl/dispatch |     18.740 |    46.605 |
|                                        Java/sun.net.www.protocol.http.HttpURLConnection/getResponseCode |     13.184 |    42.346 |
|                                        Java/sun.net.www.protocol.http.HttpURLConnection/getOutputStream |     12.194 |    35.980 |
|                           Java/org.apache.cxf.binding.soap.interceptor.SoapOutInterceptor/handleMessage |     14.187 |    34.856 |
|                                         Java/sun.net.www.protocol.http.HttpURLConnection/getInputStream |     10.981 |    33.421 |
|                                                   Java/com.zaxxer.hikari.HikariDataSource/getConnection |      9.171 |    28.940 |
|                                                                          Datastore/operation/Redis/HSET |     34.804 |    22.405 |
|                                                                           Datastore/operation/Redis/DEL |     33.911 |     9.934 |
|                                                    Java/io.lettuce.core.RedisAsyncCommandsImpl/dispatch |      0.787 |     6.850 |
|                                                                 Datastore/statement/Postgres/.../select |      0.598 |     3.483 |
|                                                                        Datastore/operation/Redis/LRANGE |      2.570 |     2.980 |
|                                    Java/org.apache.cxf.interceptor.StaxOutEndingInterceptor/handleFault |      0.094 |     2.127 |
|                                                             Datastore/statement/MongoDB/...Event/insert |      0.000 |     1.712 |
|                                                          External/.../HttpURLConnection/getResponseCode |     17.495 |     1.428 |
|                                                                          Datastore/operation/Redis/HDEL |      1.001 |     0.883 |
|                                                                        Datastore/operation/Redis/LINDEX |      8.631 |     0.710 |
|    Java/org.apache.cxf.binding.soap.interceptor.SoapOutInterceptor$SoapOutEndingInterceptor/handleFault |      0.024 |     0.364 |
|                                                                 Datastore/statement/Postgres/.../select |     19.419 |     0.253 |
| Filter/org.springframework.security.web.context.request.async.WebAsyncManagerIntegrationFilter/doFilter |      0.000 |     0.000 |
|     Java/org.apache.cxf.interceptor.MessageSenderInterceptor$MessageSenderEndingInterceptor/handleFault |      0.374 |     0.000 |
|           Filter/org.springframework.security.web.authentication.www.BasicAuthenticationFilter/doFilter |      0.000 |     0.000 |
|            Filter/org.springframework.boot.web.servlet.DelegatingFilterProxyRegistrationBean$1/doFilter |      0.000 |     0.000 |
|              Filter/org.springframework.boot.web.servlet.filter.OrderedCharacterEncodingFilter/doFilter |      0.000 |     0.000 |
|                Filter/org.springframework.boot.actuate.metrics.web.servlet.WebMvcMetricsFilter/doFilter |      0.000 |     0.000 |
|                 Java/org.apache.cxf.binding.soap.interceptor.SoapHeaderOutFilterInterceptor/handleFault |      0.299 |     0.000 |
|                  Java/org.apache.cxf.binding.soap.interceptor.SoapPreProtocolOutInterceptor/handleFault |      0.000 |     0.000 |
|                    Filter/org.springframework.boot.web.servlet.filter.OrderedFormContentFilter/doFilter |      0.000 |     0.000 |
|                         Filter/org.springframework.security.web.session.DisableEncodeUrlFilter/doFilter |      0.000 |     0.000 |
|                           Java/sun.net.www.protocol.https.DelegateHttpsURLConnection/getResponseMessage |      0.000 |     0.000 |
|                           Java/org.apache.cxf.jaxws.interceptors.WrapperClassOutInterceptor/handleFault |      0.075 |     0.000 |
|                             Java/org.apache.cxf.binding.soap.interceptor.SoapOutInterceptor/handleFault |      0.002 |     0.016 |
|                              Filter/org.springframework.security.web.header.HeaderWriterFilter/doFilter |      0.000 |     0.000 |
|                               Java/sun.net.www.protocol.https.DelegateHttpsURLConnection/getInputStream |      0.000 |     0.000 |
|                                Java/org.glassfish.jersey.server.ServerRuntime$Responder/processResponse |      0.000 |     0.000 |
|                                 Java/org.apache.cxf.jaxws.interceptors.HolderOutInterceptor/handleFault |      0.005 |     0.000 |
|                                    Java/org.apache.cxf.wsdl.interceptors.BareOutInterceptor/handleFault |      0.006 |     0.068 |
|                                    Java/org.apache.cxf.jaxws.interceptors.SwAOutInterceptor/handleFault |      0.019 |     0.000 |
|                                    Java/org.apache.cxf.interceptor.MessageSenderInterceptor/handleFault |      0.000 |     0.000 |
|                                    Java/org.apache.cxf.interceptor.AttachmentOutInterceptor/handleFault |      0.000 |     0.001 |
|                                     Filter/org.springframework.web.filter.RequestContextFilter/doFilter |      0.000 |     0.000 |
|                                          Java/org.apache.cxf.ws.policy.PolicyOutInterceptor/handleFault |      0.001 |     0.000 |
|                                          Java/org.apache.cxf.interceptor.StaxOutInterceptor/handleFault |      0.001 |     0.008 |
|                                           Servlet/org.glassfish.jersey.servlet.ServletContainer/service |      0.000 |     0.000 |
|                                            Java/javax.servlet.ServletRequestListener/requestInitialized |      0.000 |     0.000 |
|                                             Filter/org.apache.tomcat.websocket.server.WsFilter/doFilter |      0.000 |     0.000 |
|                                              Java/org.glassfish.jersey.server.ApplicationHandler/handle |      0.000 |     0.000 |
|                                               Filter/org.springframework.web.filter.CorsFilter/doFilter |      0.000 |     0.000 |
|                                                         Java/com...LoggingOutputInterceptor/handleFault |      0.000 |     0.000 |
|                                                          External/.../HttpURLConnection/getResponseCode |      0.000 |     0.000 |
|                                                           Filter/com...RequestIdentifierFilter/doFilter |      0.000 |     0.000 |
|                                                             Datastore/statement/MongoDB/...Event/insert |      0.399 |     0.000 |
|                                                                 Datastore/statement/Postgres/.../select |      0.598 |     0.003 |
|                                                                 Datastore/statement/Postgres/.../select |      0.598 |     0.001 |
|                                                                     Datastore/operation/Redis/PEXPIREAT |      0.000 |     0.000 |
|                                                                        External/.../CommonsHttp/execute |      0.000 |     0.000 |
|                                                                        External/.../CommonsHttp/execute |      0.000 |     0.000 |
|                                                                        Datastore/operation/Redis/INCRBY |      0.000 |     0.000 |
|                                                                         Datastore/operation/Redis/SETEX |      0.000 |     0.000 |
|                                                                         Datastore/operation/Redis/HKEYS |      0.000 |     0.000 |
|                                                                           Java/com...RestImpl/search... |      0.000 |     0.000 |
|                                                                            Java/com...Service/search... |      0.000 |     0.000 |
|                                                                            Java/com...RestClient/get... |      0.000 |     0.000 |
|                                                                                 External/...Client/POST |      0.000 |     0.000 |
KGOH commented 5 months ago

For a counter example, here's another transaction, that also uses Redis and lettuce, but never gives negative values:

                                                                       External/.../CommonsHttp/execute 3.66  174.81 ms
                                                                             Java/com...RestImpl/get... 1.00  49.51 ms
                                                                      Datastore/operation/Redis/PEXPIRE 39.87 20.45 ms
                                             Java/org.glassfish.jersey.server.ApplicationHandler/handle 0.80  5.95 ms
                                                                                             Java/other 63.98 3.21 ms
                                                                         Datastore/operation/Redis/HGET 7.00  2.95 ms
                                                                         Datastore/operation/Redis/HSET 1.95  1.02 ms
                                                   Java/io.lettuce.core.RedisAsyncCommandsImpl/dispatch 0.00  28.26 µs
                            Java/io.lettuce.core.cluster.RedisAdvancedClusterAsyncCommandsImpl/dispatch 0.02  21.83 µs
                                            Java/org.apache.http.impl.client.InternalHttpClient/execute 0.00  7.64 µs
                                           Java/javax.servlet.ServletRequestListener/requestInitialized 0.00  5.63 µs
                               Java/org.glassfish.jersey.server.ServerRuntime$Responder/processResponse 0.00  2.46 µs
                                                          Filter/com...RequestIdentifierFilter/doFilter 0.00  1.63 µs
          Filter/org.springframework.security.web.authentication.www.BasicAuthenticationFilter/doFilter 0.00  684.94 ns
                                          Servlet/org.glassfish.jersey.servlet.ServletContainer/service 0.00  440.40 ns
             Filter/org.springframework.boot.web.servlet.filter.OrderedCharacterEncodingFilter/doFilter 0.00  110.96 ns
                                    Filter/org.springframework.web.filter.RequestContextFilter/doFilter 0.00  103.87 ns
                                              Filter/org.springframework.web.filter.CorsFilter/doFilter 0.00  41.10 ns
Filter/org.springframework.security.web.context.request.async.WebAsyncManagerIntegrationFilter/doFilter 0.00  28.14 ns
KGOH commented 5 months ago

@jtduffy Is this info helpful?

jtduffy commented 5 months ago

It's more data, but unfortunately I don't think it gets us closer to a solid repro app. This has been a sporadic issue for years and we've never been able to track it down. It just seems like there's no common denominator across applications when it does pop up. Thanks again for the report and the information you provided.

jasonjkeller commented 5 months ago

@KGOH This is probably a long shot, but if the same transaction is still regularly generating these negative/large metric values every 15 minutes, can you go into the Metric Explorer and check if there are any metrics prefixed with Supportability/API/LegacyAsync/?

If there are, try adding this legacy_async_api_skip_suspend: true config to your agent config yaml (indented 2 spaces under the pre-existing common stanza):

common: &default_settings
  legacy_async_api_skip_suspend: true

I believe that the agent should detect this yaml change and be able to apply it dynamically without requiring a JVM restart, which usually only takes a few minutes. After 15-30 minutes passes do you still see the negative/large metric values associated with the problematic transaction?

Again, this is a long shot, but based on some of the metrics for the problematic transaction I'm wondering if the async servlet/Jetty continuations APIs are being used under the hood and if this might be related to our legacy async API that traces such activity.

kford-newrelic commented 4 months ago

@KGOH apologies but we don't have enough information to investigate this further. If you later identify a pattern or other details that might be helpful for us to reproduce this issue, do let us know.