job does not really terminate when --max_bad_records reached, it continues on for a bit

nickyua / google-bigquery-tools

Automatically exported from code.google.com/p/google-bigquery-tools

0 stars 0 forks source link

I'm filing this as a defect because it just seems wrong. I am bq load'ing tsv 
files with millions of lines. It takes a long time to process such files.

I had a job that failed because "Too many errors encountered. Limit is: 0." It 
took a long time to receive this information.

However, looking at --apilog I see that the same error occurred several hundred 
times, so it appears that the max_bad_records condition isn't checked until 
after all or many errors are counted, rather than actively during the 
collection of said errors. Is that right?

Had my job failed on line 171 instead of continuing on to report line 20863 as 
invalid for the same reason, I could have corrected the error quickly and 
resubmitted the job.

This behavior was realized using BigQuery CLI v2.0.3 on an ubuntu installation.

Here alone is a snippet from --apilog that demonstrates what is disturbing to 
me:

   {
    "reason": "invalid",
    "location": "Line:20648 / Field:5",
    "message": "Value cannot be converted to expected type (check schema): field starts with: \u003cnull\u003e"
   },
   {
    "reason": "invalid",
    "location": "Line:20863 / Field:5",
    "message": "Value cannot be converted to expected type (check schema): field starts with: \u003cnull\u003e"
   },
   {
    "reason": "invalid",
    "message": "Too many errors encountered. Limit is: 0."

Here are two reasons to have failed the job; why did it even bother to arrive 
at the second case if the bad record limit is 0? Thanks.

Original issue reported on code.google.com by net.equ...@gmail.com on 6 Apr 2012 at 5:13

TL;DR: Yes, imports could be faster; however, stopping at the first bad record wouldn't make that happen. So there are two issues here: * The actual time spent processing records is a tiny fraction of the total job time you see; much more is spent copying data and letting a job progress through the various parts of our system. Stopping at the first bad record wouldn't change the total job time in any appreciable way. * We purposely go past max_bad_records before returning the error_stream to you, up to some bound (which I believe is 100 when max_bad_records is 0). This is by design -- if you have a file with, say, 3 bad records, it's a little annoying to have to correct/upload/submit job/wait/see an error for each of those, when we could have given you all that information on the first go. As mentioned, this doesn't noticeably change runtime, so it seems like a win. I'm going to close this as "working as intended" -- feel free to reopen or transfer to bigquery-discuss if you want to know more.

nickyua / google-bigquery-tools

job does not really terminate when --max_bad_records reached, it continues on for a bit #5