Closed martinstuder closed 7 years ago
In general (for both table and SQL queries), it seems to take a long time to even start transferring data (looking at the spark master network bytes graph when running on dataproc in yarn-cluster mode). We should have a look at the Spark UI to see what is going on.
SQL queries are much faster than table queries (1s vs 11s) although both are super slow:
Table: first at BigQuerySQLContext.scala:112 (11s) + sql at
It seems like the data transfer indeed takes a lot time before queries are executed, which is not visible on Spark UI.
Table reference:
17/11/08 11:25:30 INFO bigquery.DynamicFileListRecordReader: Initializing DynamicFileListRecordReader with split 'InputSplit:: length:691 locations: [] toString(): gs://sbb/hadoop/tmp/bigquery/job_20171108112518_0089/shard-1/data-*.avro[691 estimated records]', task context 'TaskAttemptContext:: TaskAttemptID:attempt_20171108112518_0089_m_000001_0 Status:'
This might be related to the issue reported here: https://github.com/GoogleCloudPlatform/bigdata-interop/issues/23
It looks like SQL queries are much slower than table queries.