googleapis / java-spanner

Apache License 2.0
64 stars 116 forks source link

Spanner latency issue on streaming read #2757

Closed amb-sebastian-podgorski-pt closed 5 months ago

amb-sebastian-podgorski-pt commented 11 months ago

I found our service has problems with latency and every single read statements is taking more than 40ms.

I tried to use native/derived queries and it always takes more than 40ms.

Reading data is really fast, it is taking below 1ms but from logs I can see that establishing grcp connection to Spanner takes most of the time.

logs from reading

10:16:36.485 [http-nio-8080-exec-2] [,] DEBUG o.s.web.servlet.DispatcherServlet - GET "/test", parameters={} 10:16:36.485 [http-nio-8080-exec-2] [,] DEBUG o.s.w.s.m.m.a.RequestMappingHandlerMapping - Mapped to edu.Controller#test() 10:16:36.485 [http-nio-8080-exec-2] [,] DEBUG o.s.o.j.s.OpenEntityManagerInViewInterceptor - Opening JPA EntityManager in OpenEntityManagerInViewInterceptor 10:16:36.485 [http-nio-8080-exec-2] [,] DEBUG c.g.c.s.d.s.c.SpannerTransactionManager - Creating new transaction with name [Repository.findById]: PROPAGATION_REQUIRED,ISOLATION_DEFAULT,readOnly 10:16:36.485 [http-nio-8080-exec-2] [,] DEBUG c.g.c.s.d.s.core.SpannerTemplate - Executing read on table test_table with keys: {[6fb703b4-59b4-4381-abd5-0594924c093c]} and columns: id, name 10:16:36.485 [http-nio-8080-exec-2] [,] DEBUG c.g.c.s.d.s.core.SpannerTemplate - Read elapsed milliseconds: 0 10:16:36.487 [grpc-default-worker-ELG-1-5] [,] DEBUG i.g.n.s.i.g.netty.NettyClientHandler - [id: 0xea23po983, L:/192.168.1.1:50000 - R:spanner.googleapis.com/200.200.200.200:443] OUTBOUND HEADERS: streamId=11 headers=GrpcHttp2OutboundHeaders[:authority: spanner.googleapis.com:443, :path: /google.spanner.v1.Spanner/BeginTransaction, :method: POST, :scheme: https 10:16:36.487 [grpc-default-worker-ELG-1-5] [,] DEBUG i.g.n.s.i.g.netty.NettyClientHandler - [id: 0xea23po983, L:/192.168.1.1:50000 - R:spanner.googleapis.com/200.200.200.200:443] OUTBOUND DATA: 10:16:36.525 [grpc-default-worker-ELG-1-5] [,] DEBUG i.g.n.s.i.g.netty.NettyClientHandler - [id: 0xea23po983, L:/192.168.1.1:50000 - R:spanner.googleapis.com/200.200.200.200:443] INBOUND HEADERS:

as you can see from timestamps of the last two logs it takes 39ms, I tried to use read only transactions, makes read without transaction, set rcp priority on SpannerTemplate to low but nothing helps and simple call to rest endpoint which is fetching object by id takes more than 40ms when reading from db takes less than 0ms.

Application details

Java 17 google-cloud-spanner - 6.55 spring-cloud-gcp-starter-data-spanner - 4.8.4

arpan14 commented 11 months ago

A few queries to start investigating

  1. What's your session configuration? Please provide the min/max sessions which you are configuring using SessionPoolOptions in SpannerOptions.
  2. What is the max/average QPS for your application?
  3. What is the latency percentile? Can you provide details about the p99, p95 and p50?
  4. Are you facing troubles post using a new client version? Or is this a general problem which has always been present?
  5. Could you provide some details about your schema and a sample query which you are executing?
amb-sebastian-podgorski-pt commented 11 months ago
  1. Default values (max = -1, min = -1)
  2. I don't know yet. I am trying to achieve 1k request per sec for one instance now it is around 100 RPS
  3. p50 - 85ms, p95 - 136ms, p99 - 190ms
  4. It is general problem, I've created a simple app with really simple table and I am facing same issue
  5. Schema is really basic, some strings / integers, query has to be read to face this issue

it looks like for read queries fetching meta data using stream reading takes up to 50ms each time, reading elapsed time is from 0 to 1ms but fetching metadata takes more most of the time.

I also tried native queries / derived / using jdbc and still I am facing same issue.

Exmaple query

select * from test_table limit 1;

amb-sebastian-podgorski-pt commented 11 months ago

For every query stream seems to be in state not_ready what causes starting new stream for every request and it fetches metadata about the table using streaming and it takes more than 40ms which seems to be very long I don't know. As long as you don't call next(); on result set then it is fine but calling next function on resultSet from package com.google.cloud.spanner causes starting this long stream process

arpan14 commented 11 months ago

The default session pool options have 100 min sessions and 400 max sessions. Can you share with me the code snippet how the client object is being created?

amb-sebastian-podgorski-pt commented 10 months ago

@arpan14 the client is being created with default values using spring-cloud-gcp-autoconfigure:5.0.0

config

The values are set to -1 for min and max connections

olavloite commented 8 months ago

@amb-sebastian-podgorski-pt I think that there is an underlying misunderstanding here.

| Reading data is really fast, it is taking below 1ms but from logs I can see that establishing grcp connection to Spanner takes most of the time.

This observation is where the misunderstanding is. The Java client does not execute the query directly when you call the executeQuery(..) method. Instead, the execution of the query is delayed until the first call to ResultSet#next(). That means that:

  1. The less-than-1-ms time that you are seeing when calling executeQuery(..) is not and indication that reading the data is really fast. It is only an indication that the client library is directly ready to execute the query.
  2. The 40ms time that you are seeing the first time that you are calling ResultSet#next() is the actual query time + network latency.

40ms is however a too high value for a very simple query, which indicates that there might be more going on here. It is not clear to me whether the select * from test limit 1 query is the one that you are testing with in all cases, but the best way to test query latency is by:

  1. Execute a query like select * from test where id=@id (so use a parameterized query to select one row based on the primary key).
  2. Make sure you execute the query from a Google Cloud Compute Engine VM (or similar) in the same region as where your Cloud Spanner database is. That will minimize network latency, and also reduce any randomness in network latency that might be introduced if you run from for example your local laptop.

Using the above strategy should bring a test query like the above down to around 5ms. Note that the very first query might be slower, as the first time that Cloud Spanner sees a query, it needs to parse and plan it. Subsequent executions will re-use the plan that has already been compiled and cached.

olavloite commented 6 months ago

@amb-sebastian-podgorski-pt Is this still an issue? If so, could you try the steps that I described in the previous comment to verify the actual latency for your query?

amb-sebastian-podgorski-pt commented 5 months ago

@olavloite it is better now but I think it was related to small Spanner and application resources on testing environment. While testing I tweak it and it is at the required level for our application. Thanks for you help. I think we can close this one.