googleapis / java-bigquerystorage

Apache License 2.0
61 stars 83 forks source link

Reading data using the executeSelect API is slow #2764

Open o-shevchenko opened 1 week ago

o-shevchenko commented 1 week ago

We use executeSelect API to run SQL query and read results from BigQuery. We expected a good speed based on

Reading data using executeSelectAPI is extremely slow. Reading of 100_000 rows takes 23930 ms. The profiling showed no prominent places where we spent most of the time.

Are there any recent changes that might cause performance degradation for such an API? Do you have a benchmark to understand what performance we should expect? Thanks!

Environment details

  1. com.google.cloud:google-cloud-bigquery:2.43.3
  2. Mac OS Sonoma M1
  3. Java version: 17

Code example

Mono.fromCallable { bigQueryOptionsBuilder.build().service }
            .flatMap { context ->
                val connectionSettings = ConnectionSettings.newBuilder()
                    .setRequestTimeout(10L)
                    .setUseReadAPI(true)
                    .setMaxResults(1000)
                    .setNumBufferedRows(1000)
                    .setUseQueryCache(true)
                    .build();
                val connection = context.createConnection(connectionSettings)
                val bqResult = connection.executeSelect(sql)
                val result = Flux.usingWhen(
                    Mono.just(bqResult.resultSet),
                    { resultSet -> resultSet.toFlux(bqResult.schema) },
                    { _ -> Mono.fromRunnable<Unit> { connection.close() } }
                )
                Mono.just(Data(result, bqResult.schema.toSchema()))
            }
            ...

fun ResultSet.toFlux(schema:Schema): Flux<DataRecord> {
    return Flux.generate<DataRecord> { sink ->
        if (next()) {
            sink.next(toDataRecord(schema))
        } else {
            sink.complete()
        }
    }
}
o-shevchenko commented 5 days ago

I've created a simplified test to show performance:

@Test
    fun `test read`() {
        val sql =
            """
            SELECT *
            FROM `pr`
            """.trimIndent().replace("\n", " ")
        val connectionSettings = ConnectionSettings.newBuilder()
            .setRequestTimeout(300)
            .setUseReadAPI(true)
            .setMaxResults(5000)
            .setUseQueryCache(true)
            .build()
        val connection = bigQueryOptionsBuilder.build().service.createConnection(connectionSettings)
        val bqResult = connection.executeSelect(sql)
        val resultSet = bqResult.resultSet

        var n = 1
        var lastTime = Instant.now()
        while (++n < 1_000_000 && resultSet.next()) {
            if (n % 30_000 == 0) {
                val now = Instant.now()
                val duration = Duration.between(lastTime, now)
                println("ROW $n Time: ${duration.toMillis()} ms ${DateTimeFormatter.ISO_INSTANT.format(now)}")
                lastTime = now
            }
        }
   }     
ROW 30000 Time: 5516 ms 2024-11-14T12:35:54.354169Z
ROW 60000 Time: 11230 ms 2024-11-14T12:36:05.585005Z
ROW 90000 Time: 5645 ms 2024-11-14T12:36:11.230378Z
ROW 120000 Time: 5331 ms 2024-11-14T12:36:16.561915Z
ROW 150000 Time: 5458 ms 2024-11-14T12:36:22.019994Z
ROW 180000 Time: 5391 ms 2024-11-14T12:36:27.411807Z

~5sec to read 30000 rows

o-shevchenko commented 4 days ago

Related issue with benchmark: https://github.com/googleapis/java-bigquery/pull/3574

o-shevchenko commented 4 days ago

After fixing the test I've got the following results.

Benchmark                                            (rowLimit)  Mode  Cnt       Score       Error  Units
ConnImplBenchmark.iterateRecordsUsingReadAPI             500000  avgt    3   76549.893 ± 14496.839  ms/op
ConnImplBenchmark.iterateRecordsUsingReadAPI            1000000  avgt    3  154957.127 ± 25916.110  ms/op
ConnImplBenchmark.iterateRecordsWithBigQuery_Query       500000  avgt    3   82508.807 ± 17930.275  ms/op
ConnImplBenchmark.iterateRecordsWithBigQuery_Query      1000000  avgt    3  165717.219 ± 86960.648  ms/op
ConnImplBenchmark.iterateRecordsWithoutUsingReadAPI      500000  avgt    3   84504.175 ± 36823.590  ms/op
ConnImplBenchmark.iterateRecordsWithoutUsingReadAPI     1000000  avgt    3  165142.367 ± 99899.991  ms/op

That's not what we expected after reading the doc: https://cloud.google.com/blog/topics/developers-practitioners/introducing-executeselect-client-library-method-and-how-use-it/ image

Comparison with Chart Estimates From the chart: 1,000,000 rows: Read Storage API: The speed on the chart is ~50,000 rows/sec, but I have 6,453 rows/sec. tabledata.list API: Estimated at 5,000 rows/sec and I've got similar result 5,917 rows/sec.

Is there anything I missed?