tensorflow / io

Dataset, streaming, and file system extensions maintained by TensorFlow SIG-IO
Apache License 2.0
706 stars 288 forks source link

Sporadic failures with `BigQueryClient` #1773

Open person142 opened 1 year ago

person142 commented 1 year ago

When using BigQueryClient, we get sporadic failures of the form:

E0225 00:45:10.643495675    3362 oauth2_credentials.cc:220]  oauth_fetch: {"created":"@1677285910.643393509","description":"Failed HTTP requests to all targets","file":"external/com_github_grpc_grpc/src/core/lib/http/httpcli.cc","file_line":202,"referenced_errors":[{"created":"@1677285910.640775967","description":"Failed HTTP/1 client request","file":"external/com_github_grpc_grpc/src/core/lib/http/httpcli.cc","file_line":113,"referenced_errors":[{"created":"@1677285910.640730717","description":"Failed to connect to remote host: FD Shutdown","file":"external/com_github_grpc_grpc/src/core/lib/iomgr/lockfree_event.cc","file_line":195,"os_error":"Timeout occurred","referenced_errors":[{"created":"@1677285910.640707592","description":"connect() timed out","file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_client_posix.cc","file_line":112}],"target_address":"ipv4:172.253.115.95:443"},{"created":"@1677285910.640954259","description":"Failed to connect to remote host: FD Shutdown","file":"external/com_github_grpc_grpc/src/core/lib/iomgr/lockfree_event.cc","file_line":195,"os_error":"Timeout occurred","referenced_errors":[{"created":"@1677285910.640944467","description":"connect() timed out","file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_client_posix.cc","file_line":112}],"target_address":"ipv4:172.253.122.95:443"},{"created":"@1677285910.641089675","description":"Failed to connect to remote host: FD Shutdown","file":"external/com_github_grpc_grpc/src/core/lib/iomgr/lockfree_event.cc","file_line":195,"os_error":"Timeout occurred","referenced_errors":[{"created":"@1677285910.641084800","description":"connect() timed out","file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_client_posix.cc","file_line":112}],"target_address":"ipv4:142.250.31.95:443"},{"created":"@1677285910.643110925","description":"Failed to connect to remote host: FD Shutdown","file":"external/com_github_grpc_grpc/src/core/lib/iomgr/lockfree_event.cc","file_line":195,"os_error":"Timeout occurred","referenced_errors":[{"created":"@1677285910.643100300","description":"connect() timed out","file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_client_posix.cc","file_line":112}],"target_address":"ipv4:142.251.111.95:443"},{"created":"@1677285910.643207800","description":"Failed to connect to remote host: FD Shutdown","file":"external/com_github_grpc_grpc/src/core/lib/iomgr/lockfree_event.cc","file_line":195,"os_error":"Timeout occurred","referenced_errors":[{"created":"@1677285910.643204009","description":"connect() timed out","file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_client_posix.cc","file_line":112}],"target_address":"ipv4:142.251.16.95:443"},{"created":"@1677285910.643301592","description":"Failed to connect to remote host: FD Shutdown","file":"external/com_github_grpc_grpc/src/core/lib/iomgr/lockfree_event.cc","file_line":195,"os_error":"Timeout occurred","referenced_errors":[{"created":"@1677285910.643296925","description":"connect() timed out","file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_client_posix.cc","file_line":112}],"target_address":"ipv4:142.251.163.95:443"},{"created":"@1677285910.643386550","description":"Failed to connect to remote host: FD Shutdown","file":"external/com_github_grpc_grpc/src/core/lib/iomgr/lockfree_event.cc","file_line":195,"os_error":"Timeout occurred","referenced_errors":[{"created":"@1677285910.643382384","description":"connect() timed out","file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_client_posix.cc","file_line":112}],"target_address":"ipv4:172.253.62.95:443"}]}]}
E0225 00:45:10.644405009    3354 call_op_set.h:947]          assertion failed: false
Aborted

while streaming the data. I unfortunately can't share the code or dataset, but I can say that:

perretv commented 1 year ago

We also observed this issue in my team, this led us to abandon tfio for querying a BigQuery table as this grpc assertion failed error was consistently happening. It turns out that you can build a tensorflow dataset with a native google cloud BigQuery client, and it will not suffer from this problem:

from typing import Generator

import tensorflow as tf
from google.cloud import bigquery

client = bigquery.Client()

query_obj = client.query("SELECT * FROM project_id.dataset.table_id")
rows = query_obj.result()

def bigquery_generator() -> Generator[dict[str, tf.Tensor], None, None]:
    """Yield the BigQuery query results in tensor."""
    for row in rows:
        yield {
          "column1": tf.convert_to_tensor(row["column1"]),
          "column2": tf.convert_to_tensor(row["column2"]),
        }

tf_dataset = tf.data.Dataset.from_generator(
    bigquery_generator,
    output_signature=(
        tf.TensorSpec(shape=(), dtype=tf.string),
        tf.TensorSpec(shape=(), dtype=tf.string),
    ),
)

Obviously, you will need to adapt the code to match the table data schema :)

nishprabhu commented 4 months ago

Any update on this issue? I am still facing this problem as of io version 0.37.1

mvidyasagar-sc commented 3 months ago

Hello Team, We are looking to use tfio in production for training on BQ tables. Can anyone confirm or deny if this issue will be worked on?

mvidyasagar-sc commented 3 months ago

The failure happens almost every time after 3-4 hours of job running. I checked the grpc debug logs here is the dump before it fails hope it helps

I0819 20:32:57.854680544 1127204 completion_queue.cc:764]    cq_end_op_for_pluck(cq=0x707df4008120, tag=0x707e1dffa060, error="No Error", done=0x707f4120f53f, done_arg=0x707df400b4e0, storage=0x707df400b550)
I0819 20:32:57.854700572 1127204 completion_queue.cc:1298]   RETURN_EVENT[0x707df4008120]: OP_COMPLETE: tag:0x707e1dffa060 OK
I0819 20:32:57.858648025 1127212 call.cc:1964]               grpc_call_start_batch(call=0x707ea8008520, ops=0x707deeffbdc0, nops=1, tag=0x707deeffc060, reserved=(nil))
I0819 20:32:57.858673705 1127212 call.cc:1565]               ops[0]: RECV_MESSAGE ptr=0x707deeffc088
I0819 20:32:57.858719982 1127212 completion_queue.cc:764]    cq_end_op_for_pluck(cq=0x707ea8008190, tag=0x707deeffc060, error="No Error", done=0x707f4120f53f, done_arg=0x707ea800b560, storage=0x707ea800b5d0)
I0819 20:32:57.858736681 1127212 completion_queue.cc:1298]   RETURN_EVENT[0x707ea8008190]: OP_COMPLETE: tag:0x707deeffc060 OK
I0819 20:32:57.862616068 1127212 call.cc:1964]               grpc_call_start_batch(call=0x707e6000c760, ops=0x707deeffbdc0, nops=1, tag=0x707deeffc060, reserved=(nil))
I0819 20:32:57.862655671 1127212 call.cc:1565]               ops[0]: RECV_MESSAGE ptr=0x707deeffc088
I0819 20:32:57.862723114 1127212 completion_queue.cc:764]    cq_end_op_for_pluck(cq=0x707e60004500, tag=0x707deeffc060, error="No Error", done=0x707f4120f53f, done_arg=0x707e6000f7a0, storage=0x707e6000f810)
I0819 20:32:57.862748906 1127212 completion_queue.cc:1298]   RETURN_EVENT[0x707e60004500]: OP_COMPLETE: tag:0x707deeffc060 OK
I0819 20:32:57.875126214 1127212 call.cc:1964]               grpc_call_start_batch(call=0x707ed00086a0, ops=0x707deeffbdc0, nops=1, tag=0x707deeffc060, reserved=(nil))
I0819 20:32:57.875163922 1127212 call.cc:1565]               ops[0]: RECV_MESSAGE ptr=0x707deeffc088
I0819 20:32:57.875238139 1127212 completion_queue.cc:764]    cq_end_op_for_pluck(cq=0x707ed0008330, tag=0x707deeffc060, error="No Error", done=0x707f4120f53f, done_arg=0x707ed000b6e0, storage=0x707ed000b750)
I0819 20:32:57.875256129 1127212 completion_queue.cc:1298]   RETURN_EVENT[0x707ed0008330]: OP_COMPLETE: tag:0x707deeffc060 OK
I0819 20:32:57.875265760 1127212 call.cc:1964]               grpc_call_start_batch(call=0x707ed00086a0, ops=0x707deeffbda0, nops=1, tag=0x707deeffc040, reserved=(nil))
I0819 20:32:57.875274654 1127212 call.cc:1565]               ops[0]: RECV_STATUS_ON_CLIENT metadata=0x707ed0001740 status=0x707deeffc070 details=0x707deeffc078
D0819 20:32:57.875358205 1127212 call.cc:733]                set_final_status CLI
D0819 20:32:57.875381066 1127212 call.cc:734]                {"created":"@1724099577.875340028","description":"Error received from peer ipv4:74.125.197.95:443","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Connection reset by peer","grpc_status":14}
I0819 20:32:57.875394529 1127212 completion_queue.cc:764]    cq_end_op_for_pluck(cq=0x707ed0008330, tag=0x707deeffc040, error="No Error", done=0x707f4120f53f, done_arg=0x707dd071ea30, storage=0x707dd071eaa0)
I0819 20:32:57.875404808 1127212 completion_queue.cc:1298]   RETURN_EVENT[0x707ed0008330]: OP_COMPLETE: tag:0x707deeffc040 OK
I0819 20:32:57.875576937 1127212 call.cc:1964]               grpc_call_start_batch(call=0x707ed00086a0, ops=0x707deeffbdc0, nops=1, tag=0x707deeffc060, reserved=(nil))
I0819 20:32:57.875586053 1127212 call.cc:1565]               ops[0]: RECV_MESSAGE ptr=0x707deeffc088
I0819 20:32:57.875595125 1127212 completion_queue.cc:764]    cq_end_op_for_pluck(cq=0x707ed0008330, tag=0x707deeffc060, error="No Error", done=0x707f4120f53f, done_arg=0x707ed000b6e0, storage=0x707ed000b750)
I0819 20:32:57.875606287 1127212 completion_queue.cc:1298]   RETURN_EVENT[0x707ed0008330]: OP_COMPLETE: tag:0x707deeffc060 OK
I0819 20:32:57.875615791 1127212 call.cc:1964]               grpc_call_start_batch(call=0x707ed00086a0, ops=0x707deeffbda0, nops=1, tag=0x707deeffc040, reserved=(nil))
I0819 20:32:57.875623693 1127212 call.cc:1565]               ops[0]: RECV_STATUS_ON_CLIENT metadata=0x707ed0001740 status=0x707deeffc070 details=0x707deeffc078
E0819 20:32:57.875628783 1127212 call_op_set.h:947]          assertion failed: false
[1]    1126500 IOT instruction (core dumped) 

Please can someone take a look, like is easier with tfio BQ connector.

manish181192 commented 3 months ago

@person142 Just curious were you using prefetch in your dataset?