runt18 / google-bigquery

Automatically exported from code.google.com/p/google-bigquery
0 stars 0 forks source link

Very slow for query SELECT * FROM table without any condition #482

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1. SELECT * of table with 130 FLOAT columns and 1 TIMESTAMP with 193 127 rows, 
85 MiB
2. Take 45 seconds to get the result

What is the expected output? What do you see instead?
Today take the table from MessagePack to DataFrame is taking 2 seconds.
I would expect < 5 seconds results.

What version of the product are you using? On what operating system?
I'm using Chrome or Safari from Mac.

Please provide any additional information below.
Few screenshots are available on:
https://github.com/GoogleCloudPlatform/datalab/issues/787

Original issue reported on code.google.com by vvesseli...@slb.com on 28 Mar 2016 at 4:44

GoogleCodeExporter commented 8 years ago
As suggested on Slack, tested with larger dataset 1017 MB the results SELECT * 
with 282 FLOAT columns and 1 TIMESTAMP with 2 529 829 rows, and no results with 
the default options (Query Failed Error: Response too large to return. Consider 
setting allowLargeResults to true in your job configuration. For more details, 
see https://cloud.google.com/bigquery/querying-data#largequeryresults
Job ID: bquijob_49ee7e63_153be2b5e78).
With the options Destination Table and Allow Large Results enabled: the results 
was available after 71.1 seconds 14 MB/s versus < 2 MB/s with the smaller table.

Original comment by vvesseli...@slb.com on 28 Mar 2016 at 5:08

GoogleCodeExporter commented 8 years ago
Can you provide a job ID (including project ID) for the slow-but-successful 
query?

Note that if you just want to pull all the data from a table, the SELECT * 
query is superfluous--it's just making a copy of the table and then reading the 
copy. Try using Datalab's gcp.bigquery.Table(...).to_dataframe() instead.

Original comment by jcon...@google.com on 28 Mar 2016 at 5:34

GoogleCodeExporter commented 8 years ago
The other job ID is bquijob_595e8bdf_153a5f6e192 for the small table.

Is it possible to send project ID in a private message?

For to_dataframe() you can see the cost of moving 85 MB from BigQuery to 
DataFrame: 
https://github.com/GoogleCloudPlatform/datalab/issues/787

1a. MessagePack to Dataframe: 2.28054285049 seconds
1b. BigQuery to Dataframe: 493.655420065 seconds (45 seconds is the cost of 
BigQuery, the rest 448 seconds is how BigQuery and DataLab communicate together 
with not adapted transfer protocols)

My query are dynamic SELECT * is one of cases, but even SELECT 5 columns from 
the large table is taking too much time as well:
Query complete (24.4s elapsed, 72.7 MB processed) <3 MB/s, job ID: 
bquijob_9f17eea_153be5b9c86

General observation when rows input = output the performance are very slow ~2.5 
MB/s

Original comment by vvesseli...@slb.com on 28 Mar 2016 at 6:07

GoogleCodeExporter commented 8 years ago
Sorry for the delay getting back to you!

First of all, note that BigQuery queries always generate result tables, and the 
results are fetched by reading from those tables. So doing a SELECT * query is 
wasted work--you're basically making a copy of a table and reading from the 
copy. If you change your gcp.bigquery.Query("SELECT * FROM foo") to 
gcp.bigquery.Table("foo"), you can skip the slow part and just read the data. 
(Obviously that doesn't help if you want to do something more complex than 
SELECT *.)

As for the query itself, you've hit an unfortunate corner case in our system 
that is experiencing some performance problems. I'd suggest setting the 
allow_large_results parameter in the execute method, which will cause the query 
engine to use a different path when writing results, likely avoiding the 
current performance bottleneck.

http://googlecloudplatform.github.io/datalab/gcp.bigquery.html#gcp.bigquery.Quer
y.execute

Original comment by jcon...@google.com on 31 Mar 2016 at 10:01