Closed GoogleCodeExporter closed 8 years ago
The number of rows in the table currently seems to be 51384 (both using sql as
well as from API). Could you please provide a bit more information on the API
call that you are performing that shows 57672 rows?
Original comment by epa...@google.com
on 13 Apr 2016 at 5:00
[deleted comment]
[deleted comment]
[deleted comment]
> The number of rows in the table currently seems to be 51384 (both using sql
as well as from API)
My colleague retried to load, and removed the old table since correct data was
required for data analysis. That should be why.
I asked him to copy and keep a bad table next time (we encounter this
phenomenon about 3 times per day)
> Could you please provide a bit more information on the API call that you are
performing that shows 57672 rows?
The body of load job api is like this
https://github.com/embulk/embulk-output-bigquery/blob/7698ef320a78dd93ec79a512d5
18daec8865d3bc/lib/embulk/output/bigquery/bigquery_client.rb#L152-L170
body = {
configuration: {
load: {
destination_table: {
project_id: @project,
dataset_id: @dataset,
table_id: table,
},
schema: {
fields: fields,
},
write_disposition: 'WRITE_APPEND',
source_format: 'NEWLINE_DELIMITED_JSON',
max_bad_records: 0,
field_delimiter: nil,
encoding: 'UTF-8',
ignore_unknown_values: false,
allow_quoted_newlines: false,
}
}
opts = {
upload_source: path,
content_type: "application/octet-stream",
}
and issue this load job in parallel to one temporary table.
Original comment by seo.naot...@dena.jp
on 13 Apr 2016 at 5:58
get_table API is just like this >
https://github.com/embulk/embulk-output-bigquery/blob/7698ef320a78dd93ec79a512d5
18daec8865d3bc/lib/embulk/output/bigquery/bigquery_client.rb#L380. Nothing is
special.
Original comment by seo.naot...@dena.jp
on 13 Apr 2016 at 6:01
I found some information from the history of the table. I will provide an
update with more details 04/13 morning PDT.
Original comment by epa...@google.com
on 13 Apr 2016 at 6:41
[deleted comment]
Occured again. In this case, the number of rows was correctly 3, but it became
6.
Load job IDs:
job_nbtQdZBoT0PzW9cW46NVohP40H0 (only 1 job because number of inputs was small)
Copy job ID:
job_gg2CvFl9fEH7XrMhJ3fZPp7cl4E
Response.statistics for Load Job IDs
[job_nbtQdZBoT0PzW9cW46NVohP40H0]
response.statistics:{:creation_time=>"1460535126422",
:start_time=>"1460535144895", :load=>{:output_bytes=>"838", :output_rows=>"3",
:input_files=>"1", :input_file_bytes=>"412"}, :end_time=>"1460535155398"}
Report from embulk-output-bigquery:
{"num_input_rows":3,"num_response_rows":3,"num_output_rows":6,"num_rejected_rows
":-3}
Original comment by seo.naot...@dena.jp
on 13 Apr 2016 at 8:41
Original comment by wes...@google.com
on 13 Apr 2016 at 3:40
Looking at the first example (with 9 load jobs):
The job corresponding to loading 6288 rows (job_Quhqjcf1DuBQTYYXhWgfPXKjrek)
was submitted twice (possibly due to a retry?). The two job ids are:
job_EWDq-z0DQ9Ho5FpkeSvbm3_Iz0Q
job_Quhqjcf1DuBQTYYXhWgfPXKjrek
If you would like loads to be idempotent, you can supply a job_id to the load
job.
Original comment by epa...@google.com
on 13 Apr 2016 at 5:46
Detailed advice about managing job retry can be found here:
https://cloud.google.com/bigquery/docs/managing_jobs_datasets_projects#managingj
obs
Original comment by jcon...@google.com
on 13 Apr 2016 at 5:57
Hmm, as looking DEBUG log of google-api-ruby-client, I could not find
job_EWDq-z0DQ9Ho5FpkeSvbm3_Iz0Q.
But, thank you for your information, and I will try generating and supply a
job_id by myself.
Original comment by seo.naot...@dena.jp
on 13 Apr 2016 at 6:33
Original comment by epa...@google.com
on 13 Apr 2016 at 7:15
Original issue reported on code.google.com by
seo.naot...@dena.jp
on 13 Apr 2016 at 4:21