neo4j.core.bolt.request activesupport instrumentation is incorrect

yourpalal commented 3 years ago

The neo4j.core.bolt.requestevent duration seems to generally be very short, often 0ms. This is even the case when running queries that take multiple seconds to run.

Unfortunately, this makes it very hard to get an idea of how neo4j queries are performing, what queries are slow, etc.

My guess at what's wrong here is that it's measuring the time to send the query and get some response, but not to get the full response data. I could be wrong though. I tried to dig into this but I'm not familiar at all with neo4j-ruby-driver and seabolt, so I wasn't able to make much progress.

I was however able to confirm:

the code where this instrumentation is defined
- that the timing calculations in the logging code are correct

example

Below is the timing information for a query being run in the neo4j web interface. This query is shown in the log as taking 0ms. In both cases, the neo4j server is being run on my machine. The query itself is limited to 500 returned entities, but does basically a full scan of my neo4j db.

Runtime information:

versions:

    activegraph (10.1.0)
    neo4j-ruby-driver (1.7.4)
    activesupport (6.1.3)

    ruby 3.0.0 (mri)

neo4j: 4.0.0

yourpalal commented 3 years ago

@klobuczek do you have any ideas on where to start looking in order to fix this? I have a feeling that improving the instrumentation will probably involve getting a better idea of how the driver works?

If you can provide some pointers on where to look I will put together a PR.

yourpalal commented 2 years ago

I noticed that the run handlers in the neo4j-ruby-driver classes have access to a summary object with nice run-time info that appears to come straight from the driver, or maybe from the server.

The run handlers also have a lifecycle that allows for recognizing when a query has actually finished pulling all the related data.

Using this, I did a little monkey patching in a rails initializer to hack together a proof-of-concept!

class Neo4j::Driver::Internal::Handlers::TransactionPullAllResponseHandler
  def after_success(_)
    # instrument new event - finished queries
    ActiveSupport::Notifications.instrument('neo4j.core.bolt.result', summary)
  end
end

module ActiveGraph
  module Base
    def self.subscribe_to_request(&logger_proc)
      super(&logger_proc)

      ActiveSupport::Notifications.subscribe('neo4j.core.bolt.result') do |_, start, finish, _id, summary|
        query = summary.statement.text
        available = summary.result_available_after
        consumed = summary.result_consumed_after

        yield "#{ANSI::RED}#{query}#{ANSI::CLEAR} #{ANSI::BLUE}BOLT:#{ANSI::CLEAR} #{ANSI::YELLOW}#{available}ms + #{consumed}ms#{ANSI::CLEAR}"
      end
    end
  end
end

Obviously this isn't perfect, but I think it shows that there's good information to be had that can be used for instrumentation!

This also confirmed that the current instrumentation numbers don't tell the whole story. In any substantial query, the durations logged with my new instrumentation are longer than those given by the standard neo4j.core.bolt.request instrumentation.

@klobuczek if you can provide a pointer on what the ideal form of this would be I can clean it up and turn it into a nice PR.

presumably all instrumentation should be in the activegraph gem. Would monkey-patching TransactionPullAllResponseHandler (or another handler) be acceptable in that gem?
the order of the neo4j.core.bolt.query and neo4j.core.bolt.result (my addition) instrumentations is no longer be guaranteed. Would it be okay to merge them into a single instrumentation that could thereby ensure the query text + times are correctly associated in the log file?
do you foresee any problems with this approach? One issue I can see is that the time it takes to pull all results is dependent on the time the user application takes to iterate through each on the query. That said, I think this is still very helpful information. Quite often, the results are pulled into an array, which should not add much overhead.

neo4jrb / activegraph

neo4j.core.bolt.request activesupport instrumentation is incorrect #1647

example

Runtime information: