Increasing load times - Githubissues

GoogleCodeExporter commented 9 years ago

Load times from Cloud Storage into BigQuery are taking increasingly long over 
the past few days.  Where load times used to take ~5 minutes, they now 
regularly take more than 20 minutes for datasets of the same size (around 63 MB 
compressed); this was the max limit we had before a retry, so I'm not sure how 
much longer they would take on average without a retry.  We were originally 
hoping to have a full ETL completed in under an hour.  Now it is taking on the 
order of an hour (with retries) to load the data from Cloud Storage into 
BigQuery, above and beyond time needed for preparation of the data.  It is 
possible that our account has been throttled for some reason, but the only 
throttle I am familiar with is one that is triggered when more than 10 queries 
per second are submitted.

Is the increasing load time due to resource contention?  Is it the result of a 
conscious decision?  Is it due to errors and retries in the BQ infrastructure?

It was not clear where to raise this issue.  It did not seem to belong on 
StackOverflow, but I could be mistaken.

Original issue reported on code.google.com by ewal...@pagerduty.com on 3 Jun 2014 at 9:15

GoogleCodeExporter commented 9 years ago

Hi, do you have some job ids for the load jobs that we could look into?

Original comment by thomasp...@google.com on 4 Jun 2014 at 9:48

GoogleCodeExporter commented 9 years ago

Thanks for getting back to me.  Here are two recent long-running load jobs:

  bqjob_r3093f8ddda320493_00000146685decfd_1   load       SUCCESS   04 Jun 12:32:20   0:20:49
  bqjob_r54ff203f8b39c4b_000001466907c091_1    load       SUCCESS   04 Jun 15:37:43   0:16:22

Perhaps there is a way to use the `bq` command to list failed jobs.  My ETL 
code just killed a job that had exceeded 30 minutes duration.  (It's hard to 
capture the load ID during the running of the `bq` command during an ETL, as it 
assumes an interactive console and spams the log file with ^Ms and updates, 
necessitating the --quiet flag.)

If you look into these jobs and see something suboptimal in the way the data 
are being prepared, let me know.

Original comment by ewal...@pagerduty.com on 4 Jun 2014 at 11:24

GoogleCodeExporter commented 9 years ago

s/load ID/job ID/

Original comment by ewal...@pagerduty.com on 4 Jun 2014 at 11:26

GoogleCodeExporter commented 9 years ago

Looking at the recent history for your load jobs that have been similarly 
configured to the two above (~60 MB compressed file from Cloud Storage), the 
average execution time has been over 10 minutes. I have not observed any 
throttling for this load job, nor extraordinary resource contention.

Although the source file is ~60 MB compressed, this expands to several GB 
uncompressed. One suggestion is to split your source file into smaller chunks 
if possible, and you can specify multiple sourceUris in a single load job: 
https://developers.google.com/bigquery/docs/reference/v2/jobs#configuration.load
.sourceUris. This will help parallelize the load processing. However, I should 
note that splitting the source file into n files will not guarantee an n-times 
speed up due to a number of factors, such as fixed setup/teardown costs, load 
in the system, etc.

Original comment by bch...@google.com on 6 Jun 2014 at 12:43

GoogleCodeExporter commented 9 years ago

Thank you for the pointer.  What is a good size for the individual chunks 
(uncompressed)?

Original comment by ewal...@pagerduty.com on 6 Jun 2014 at 12:52

GoogleCodeExporter commented 9 years ago

I suggest trying to split the size in half first to see if this achieves your 
time range.

Original comment by bch...@google.com on 6 Jun 2014 at 8:34

GoogleCodeExporter commented 9 years ago

Just following up to see if you have any additional questions.

I'm going to resolve this ticket, but please feel free to re-open if there is 
anything outstanding.

Original comment by bch...@google.com on 11 Jun 2014 at 10:51

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

Hi again -- just to ping you regarding our recent load times.  Our production 
deployment has been unmodified for the last few days, but as of today have not 
had a successful load:

  http://i.imgur.com/P4YXO75.png

Here is a typical command that we're running that does not complete:

  % .python-env/bin/bq load --field_delimiter tab --source_format CSV --job_id load_1000102 <table> gs://<path>/event_extra_data-1-copy.tsv.gz google/bigquery/schemas/raw/event_extra_data.json
Waiting on load_1000102 ... (1855s) Current status: RUNNING 

(There's a job ID that you can use to look up the actual details.)  I'm not 
sure if our account is being throttled or if there is a general service 
disruption (or maybe something else).  The running load time for the command 
above is 1855 s (~ 30 min.).  I've seen times get as high as 4080 s before a 
command is canceled.  Hopefully we're not doing something wrong.  Any help you 
can provide in troubleshooting this is appreciated.  (I posted a comment 
earlier today and then took it down in order to avoid causing a wild goose 
chase while I collected data.)

Original comment by ewal...@pagerduty.com on 9 Jul 2014 at 1:32

GoogleCodeExporter commented 9 years ago

There was an incident today where some load jobs were progressing very slowly, 
creating a large backlog.  We've diagnosed and fixed the root cause and the 
backlog has mostly cleared by now: it should be fully gone within the next few 
hours.  New load jobs should be unaffected and progress at their normal pace.

Thanks for reporting the issue, and apologies if the slowness negatively 
affected you.

Original comment by dk...@google.com on 9 Jul 2014 at 4:06

GoogleCodeExporter commented 9 years ago

Thank you for getting back to me.  Glad to know we weren't doing something 
wrong.

Original comment by ewal...@pagerduty.com on 9 Jul 2014 at 9:58

GoogleCodeExporter commented 9 years ago

Hi -- we're seeing failing loads again.  Just wanted to check in with you to 
make sure we're not doing something wrong on our side (I assume we aren't, but 
I wanted to double-check).

http://i.imgur.com/7FedCNG.png

Original comment by ewal...@pagerduty.com on 31 Jul 2014 at 12:50

GoogleCodeExporter commented 9 years ago

Yep, we have a backlog of load jobs that should be starting to clear. Sorry for 
the inconvenience, but hopefully this will be wrapped up in an hour or two.

By the way, feel free to create new bugs for new issues, since adding comments 
to an already-fixed bug might cause your issue report to be overlooked.

Original comment by jcon...@google.com on 31 Jul 2014 at 6:39

GoogleCodeExporter commented 9 years ago

> Sorry for the inconvenience, but hopefully this will be wrapped up in an hour 
or two.

We love BigQuery, not the least because it makes so many things possible.  But 
the service disruptions are pretty inconvenient.  It puts my team in an awkward 
position when the freshness of the data in our dashboards falls behind by 
several hours, especially during the last few days of the month, when there's a 
big push to close deals.  Are there any plans to add additional capacity to 
accommodate the periodic spikes in heavy load jobs?

> By the way, feel free to create new bugs for new issues

I will plan to create a new issue, then, if we're curious for a status check.  
Thanks for replying to this one.  Btw, do you guys have a status page somewhere?

Original comment by ewal...@pagerduty.com on 31 Jul 2014 at 3:53

GoogleCodeExporter commented 9 years ago

Understood. We know that these service disruptions are painful, and we're doing 
our best to keep things running smoothly. Apologies for the trouble!

In this case, the root cause turned out to be a misconfiguration that gave our 
service unexpectedly low priority for network bandwidth. Unfortunately we 
didn't discover the misconfiguration until network bandwidth became scarce 
yesterday.

We don't yet have a status page, but we'll look into getting one set up. In the 
meantime, don't hesitate to reach out to us proactively by filing a bug here 
when you think something's wrong. (Obviously we'll do our best to fix problems 
before you notice!)

By the way, have you considered using our streaming API? Bulk data loads don't 
have any strict latency guarantees, so if freshness is important to you, you 
might find the streaming API useful.

Original comment by jcon...@google.com on 31 Jul 2014 at 4:05

GoogleCodeExporter commented 9 years ago

Thanks for the additional details.  I'm sure you guys work hard to keep things 
running smoothly.

> By the way, have you considered using our streaming API?

I'd love to use a low-latency API like the streaming API.  But we have very 
strict requirements on the accuracy of the numbers we show.   I'm concerned 
that if resource contention or retry logic on our side caused us to go beyond 
the short dedup period offered by the streaming API, we'd end up inserting 
duplicate rows.

Original comment by ewal...@pagerduty.com on 31 Jul 2014 at 4:14

GoogleCodeExporter commented 9 years ago

Got it. Thanks for the feedback!

Original comment by jcon...@google.com on 31 Jul 2014 at 4:30

GoogleCodeExporter commented 9 years ago

Also, sometimes we have to revise data that we would already have published to 
the API in the past, and I believe the streaming API is append-only.  I think 
there are some clever things you can do to achieve the same effect as updates 
to existing rows with an append-only table, but we're not there yet.

Original comment by ewal...@pagerduty.com on 31 Jul 2014 at 4:32

google-code-export / google-bigquery

Increasing load times #103