Open person142 opened 1 year ago
We also observed this issue in my team, this led us to abandon tfio
for querying a BigQuery table as this grpc assertion failed
error was consistently happening.
It turns out that you can build a tensorflow dataset with a native google cloud BigQuery client, and it will not suffer from this problem:
from typing import Generator
import tensorflow as tf
from google.cloud import bigquery
client = bigquery.Client()
query_obj = client.query("SELECT * FROM project_id.dataset.table_id")
rows = query_obj.result()
def bigquery_generator() -> Generator[dict[str, tf.Tensor], None, None]:
"""Yield the BigQuery query results in tensor."""
for row in rows:
yield {
"column1": tf.convert_to_tensor(row["column1"]),
"column2": tf.convert_to_tensor(row["column2"]),
}
tf_dataset = tf.data.Dataset.from_generator(
bigquery_generator,
output_signature=(
tf.TensorSpec(shape=(), dtype=tf.string),
tf.TensorSpec(shape=(), dtype=tf.string),
),
)
Obviously, you will need to adapt the code to match the table data schema :)
Any update on this issue? I am still facing this problem as of io version 0.37.1
Hello Team, We are looking to use tfio in production for training on BQ tables. Can anyone confirm or deny if this issue will be worked on?
The failure happens almost every time after 3-4 hours of job running. I checked the grpc debug logs here is the dump before it fails hope it helps
I0819 20:32:57.854680544 1127204 completion_queue.cc:764] cq_end_op_for_pluck(cq=0x707df4008120, tag=0x707e1dffa060, error="No Error", done=0x707f4120f53f, done_arg=0x707df400b4e0, storage=0x707df400b550)
I0819 20:32:57.854700572 1127204 completion_queue.cc:1298] RETURN_EVENT[0x707df4008120]: OP_COMPLETE: tag:0x707e1dffa060 OK
I0819 20:32:57.858648025 1127212 call.cc:1964] grpc_call_start_batch(call=0x707ea8008520, ops=0x707deeffbdc0, nops=1, tag=0x707deeffc060, reserved=(nil))
I0819 20:32:57.858673705 1127212 call.cc:1565] ops[0]: RECV_MESSAGE ptr=0x707deeffc088
I0819 20:32:57.858719982 1127212 completion_queue.cc:764] cq_end_op_for_pluck(cq=0x707ea8008190, tag=0x707deeffc060, error="No Error", done=0x707f4120f53f, done_arg=0x707ea800b560, storage=0x707ea800b5d0)
I0819 20:32:57.858736681 1127212 completion_queue.cc:1298] RETURN_EVENT[0x707ea8008190]: OP_COMPLETE: tag:0x707deeffc060 OK
I0819 20:32:57.862616068 1127212 call.cc:1964] grpc_call_start_batch(call=0x707e6000c760, ops=0x707deeffbdc0, nops=1, tag=0x707deeffc060, reserved=(nil))
I0819 20:32:57.862655671 1127212 call.cc:1565] ops[0]: RECV_MESSAGE ptr=0x707deeffc088
I0819 20:32:57.862723114 1127212 completion_queue.cc:764] cq_end_op_for_pluck(cq=0x707e60004500, tag=0x707deeffc060, error="No Error", done=0x707f4120f53f, done_arg=0x707e6000f7a0, storage=0x707e6000f810)
I0819 20:32:57.862748906 1127212 completion_queue.cc:1298] RETURN_EVENT[0x707e60004500]: OP_COMPLETE: tag:0x707deeffc060 OK
I0819 20:32:57.875126214 1127212 call.cc:1964] grpc_call_start_batch(call=0x707ed00086a0, ops=0x707deeffbdc0, nops=1, tag=0x707deeffc060, reserved=(nil))
I0819 20:32:57.875163922 1127212 call.cc:1565] ops[0]: RECV_MESSAGE ptr=0x707deeffc088
I0819 20:32:57.875238139 1127212 completion_queue.cc:764] cq_end_op_for_pluck(cq=0x707ed0008330, tag=0x707deeffc060, error="No Error", done=0x707f4120f53f, done_arg=0x707ed000b6e0, storage=0x707ed000b750)
I0819 20:32:57.875256129 1127212 completion_queue.cc:1298] RETURN_EVENT[0x707ed0008330]: OP_COMPLETE: tag:0x707deeffc060 OK
I0819 20:32:57.875265760 1127212 call.cc:1964] grpc_call_start_batch(call=0x707ed00086a0, ops=0x707deeffbda0, nops=1, tag=0x707deeffc040, reserved=(nil))
I0819 20:32:57.875274654 1127212 call.cc:1565] ops[0]: RECV_STATUS_ON_CLIENT metadata=0x707ed0001740 status=0x707deeffc070 details=0x707deeffc078
D0819 20:32:57.875358205 1127212 call.cc:733] set_final_status CLI
D0819 20:32:57.875381066 1127212 call.cc:734] {"created":"@1724099577.875340028","description":"Error received from peer ipv4:74.125.197.95:443","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Connection reset by peer","grpc_status":14}
I0819 20:32:57.875394529 1127212 completion_queue.cc:764] cq_end_op_for_pluck(cq=0x707ed0008330, tag=0x707deeffc040, error="No Error", done=0x707f4120f53f, done_arg=0x707dd071ea30, storage=0x707dd071eaa0)
I0819 20:32:57.875404808 1127212 completion_queue.cc:1298] RETURN_EVENT[0x707ed0008330]: OP_COMPLETE: tag:0x707deeffc040 OK
I0819 20:32:57.875576937 1127212 call.cc:1964] grpc_call_start_batch(call=0x707ed00086a0, ops=0x707deeffbdc0, nops=1, tag=0x707deeffc060, reserved=(nil))
I0819 20:32:57.875586053 1127212 call.cc:1565] ops[0]: RECV_MESSAGE ptr=0x707deeffc088
I0819 20:32:57.875595125 1127212 completion_queue.cc:764] cq_end_op_for_pluck(cq=0x707ed0008330, tag=0x707deeffc060, error="No Error", done=0x707f4120f53f, done_arg=0x707ed000b6e0, storage=0x707ed000b750)
I0819 20:32:57.875606287 1127212 completion_queue.cc:1298] RETURN_EVENT[0x707ed0008330]: OP_COMPLETE: tag:0x707deeffc060 OK
I0819 20:32:57.875615791 1127212 call.cc:1964] grpc_call_start_batch(call=0x707ed00086a0, ops=0x707deeffbda0, nops=1, tag=0x707deeffc040, reserved=(nil))
I0819 20:32:57.875623693 1127212 call.cc:1565] ops[0]: RECV_STATUS_ON_CLIENT metadata=0x707ed0001740 status=0x707deeffc070 details=0x707deeffc078
E0819 20:32:57.875628783 1127212 call_op_set.h:947] assertion failed: false
[1] 1126500 IOT instruction (core dumped)
Please can someone take a look, like is easier with tfio BQ connector.
@person142 Just curious were you using prefetch in your dataset?
When using
BigQueryClient
, we get sporadic failures of the form:while streaming the data. I unfortunately can't share the code or dataset, but I can say that: