Open GoogleCodeExporter opened 8 years ago
As suggested on Slack, tested with larger dataset 1017 MB the results SELECT *
with 282 FLOAT columns and 1 TIMESTAMP with 2 529 829 rows, and no results with
the default options (Query Failed Error: Response too large to return. Consider
setting allowLargeResults to true in your job configuration. For more details,
see https://cloud.google.com/bigquery/querying-data#largequeryresults
Job ID: bquijob_49ee7e63_153be2b5e78).
With the options Destination Table and Allow Large Results enabled: the results
was available after 71.1 seconds 14 MB/s versus < 2 MB/s with the smaller table.
Original comment by vvesseli...@slb.com
on 28 Mar 2016 at 5:08
Can you provide a job ID (including project ID) for the slow-but-successful
query?
Note that if you just want to pull all the data from a table, the SELECT *
query is superfluous--it's just making a copy of the table and then reading the
copy. Try using Datalab's gcp.bigquery.Table(...).to_dataframe() instead.
Original comment by jcon...@google.com
on 28 Mar 2016 at 5:34
The other job ID is bquijob_595e8bdf_153a5f6e192 for the small table.
Is it possible to send project ID in a private message?
For to_dataframe() you can see the cost of moving 85 MB from BigQuery to
DataFrame:
https://github.com/GoogleCloudPlatform/datalab/issues/787
1a. MessagePack to Dataframe: 2.28054285049 seconds
1b. BigQuery to Dataframe: 493.655420065 seconds (45 seconds is the cost of
BigQuery, the rest 448 seconds is how BigQuery and DataLab communicate together
with not adapted transfer protocols)
My query are dynamic SELECT * is one of cases, but even SELECT 5 columns from
the large table is taking too much time as well:
Query complete (24.4s elapsed, 72.7 MB processed) <3 MB/s, job ID:
bquijob_9f17eea_153be5b9c86
General observation when rows input = output the performance are very slow ~2.5
MB/s
Original comment by vvesseli...@slb.com
on 28 Mar 2016 at 6:07
Sorry for the delay getting back to you!
First of all, note that BigQuery queries always generate result tables, and the
results are fetched by reading from those tables. So doing a SELECT * query is
wasted work--you're basically making a copy of a table and reading from the
copy. If you change your gcp.bigquery.Query("SELECT * FROM foo") to
gcp.bigquery.Table("foo"), you can skip the slow part and just read the data.
(Obviously that doesn't help if you want to do something more complex than
SELECT *.)
As for the query itself, you've hit an unfortunate corner case in our system
that is experiencing some performance problems. I'd suggest setting the
allow_large_results parameter in the execute method, which will cause the query
engine to use a different path when writing results, likely avoiding the
current performance bottleneck.
http://googlecloudplatform.github.io/datalab/gcp.bigquery.html#gcp.bigquery.Quer
y.execute
Original comment by jcon...@google.com
on 31 Mar 2016 at 10:01
Original issue reported on code.google.com by
vvesseli...@slb.com
on 28 Mar 2016 at 4:44