Open o-shevchenko opened 1 week ago
I've created a simplified test to show performance:
@Test
fun `test read`() {
val sql =
"""
SELECT *
FROM `pr`
""".trimIndent().replace("\n", " ")
val connectionSettings = ConnectionSettings.newBuilder()
.setRequestTimeout(300)
.setUseReadAPI(true)
.setMaxResults(5000)
.setUseQueryCache(true)
.build()
val connection = bigQueryOptionsBuilder.build().service.createConnection(connectionSettings)
val bqResult = connection.executeSelect(sql)
val resultSet = bqResult.resultSet
var n = 1
var lastTime = Instant.now()
while (++n < 1_000_000 && resultSet.next()) {
if (n % 30_000 == 0) {
val now = Instant.now()
val duration = Duration.between(lastTime, now)
println("ROW $n Time: ${duration.toMillis()} ms ${DateTimeFormatter.ISO_INSTANT.format(now)}")
lastTime = now
}
}
}
ROW 30000 Time: 5516 ms 2024-11-14T12:35:54.354169Z
ROW 60000 Time: 11230 ms 2024-11-14T12:36:05.585005Z
ROW 90000 Time: 5645 ms 2024-11-14T12:36:11.230378Z
ROW 120000 Time: 5331 ms 2024-11-14T12:36:16.561915Z
ROW 150000 Time: 5458 ms 2024-11-14T12:36:22.019994Z
ROW 180000 Time: 5391 ms 2024-11-14T12:36:27.411807Z
~5sec to read 30000 rows
Related issue with benchmark: https://github.com/googleapis/java-bigquery/pull/3574
After fixing the test I've got the following results.
Benchmark (rowLimit) Mode Cnt Score Error Units
ConnImplBenchmark.iterateRecordsUsingReadAPI 500000 avgt 3 76549.893 ± 14496.839 ms/op
ConnImplBenchmark.iterateRecordsUsingReadAPI 1000000 avgt 3 154957.127 ± 25916.110 ms/op
ConnImplBenchmark.iterateRecordsWithBigQuery_Query 500000 avgt 3 82508.807 ± 17930.275 ms/op
ConnImplBenchmark.iterateRecordsWithBigQuery_Query 1000000 avgt 3 165717.219 ± 86960.648 ms/op
ConnImplBenchmark.iterateRecordsWithoutUsingReadAPI 500000 avgt 3 84504.175 ± 36823.590 ms/op
ConnImplBenchmark.iterateRecordsWithoutUsingReadAPI 1000000 avgt 3 165142.367 ± 99899.991 ms/op
That's not what we expected after reading the doc: https://cloud.google.com/blog/topics/developers-practitioners/introducing-executeselect-client-library-method-and-how-use-it/
Comparison with Chart Estimates From the chart: 1,000,000 rows: Read Storage API: The speed on the chart is ~50,000 rows/sec, but I have 6,453 rows/sec. tabledata.list API: Estimated at 5,000 rows/sec and I've got similar result 5,917 rows/sec.
Is there anything I missed?
We use executeSelect API to run SQL query and read results from BigQuery. We expected a good speed based on
Reading data using
executeSelect
API is extremely slow. Reading of 100_000 rows takes 23930 ms. The profiling showed no prominent places where we spent most of the time.Are there any recent changes that might cause performance degradation for such an API? Do you have a benchmark to understand what performance we should expect? Thanks!
Environment details
com.google.cloud:google-cloud-bigquery:2.43.3
Code example