miraisolutions / sparkbq

Sparklyr extension package to connect to Google BigQuery
GNU General Public License v3.0
19 stars 3 forks source link

Table vs SQL query performance #2

Closed martinstuder closed 7 years ago

martinstuder commented 7 years ago

It looks like SQL queries are much slower than table queries.

martinstuder commented 7 years ago

In general (for both table and SQL queries), it seems to take a long time to even start transferring data (looking at the spark master network bytes graph when running on dataproc in yarn-cluster mode). We should have a look at the Spark UI to see what is going on.

martinstuder commented 7 years ago

image

image

demirelo commented 7 years ago

SQL queries are much faster than table queries (1s vs 11s) although both are super slow: Table: first at BigQuerySQLContext.scala:112 (11s) + sql at :0 (11s). SQL: first at BigQuerySQLContext.scala:112 (11s) + sql at NativeMethodAccessorImpl.java:0 (1s).

It seems like the data transfer indeed takes a lot time before queries are executed, which is not visible on Spark UI.

demirelo commented 7 years ago

Table reference:

17/11/08 11:25:30 INFO bigquery.DynamicFileListRecordReader: Initializing DynamicFileListRecordReader with split 'InputSplit:: length:691 locations: [] toString(): gs://sbb/hadoop/tmp/bigquery/job_20171108112518_0089/shard-1/data-*.avro[691 estimated records]', task context 'TaskAttemptContext:: TaskAttemptID:attempt_20171108112518_0089_m_000001_0 Status:'

This might be related to the issue reported here: https://github.com/GoogleCloudPlatform/bigdata-interop/issues/23

demirelo commented 7 years ago

The 11-second delay was due to a GCP setting, which we made it configurable via a previous commit.

The slowdown of SQL vs Table stems from the fact that Spotify Spark-BigQuery converts SQL query initially into a temporary table.