bigquery/storage/managedwriter: better documentation for AppendRows usage across goroutines

jameshartig commented 1 year ago

Is your feature request related to a problem? Please describe. I'm trying to understand how to stream high-throughput data into the default stream. Should I be making a new ManagedStream for every single request and appending a single row? Should I instead make a single ManagedStream for the whole application instance and call AppendRows from each request? Is AppendRows safe for concurrent use? If it's not safe for concurrent use then should I have a single goroutine that bundles incoming rows and appends them in batches?

Describe the solution you'd like There needs to be more docs on usage of the a ManagedStream in different cases and what you should and should not do. The existing Best Practices docs do not cover the go client.

Describe alternatives you've considered Alternatives would be just picking a path and seeing if it breaks but I'd be nice to get some guidance ahead of time.

Additional context We have a centralized service that will be handling hundreds or thousands of rows per second and we believe the Storage Write API is the best choice for us. We plan to have a single "leader" instance accepting all rows to append from various sources but it's not obvious how the implementation should work to actually use the ManagedWriter at scale.

jameshartig commented 1 year ago

@shollyman any update on this? Thanks!

derekperkins commented 1 year ago

There has been some discussion about this, resulting in some guidance in the Go docs. They don't fully cover your questions, but provide some color into the issue

Related issues:

jameshartig commented 1 year ago

@derekperkins I appreciate those links. While that's helpful from a connection standpoint and does at least answer that multiple writers can share the same connection, I'm really looking for answers to the following question:

Can I share a single writer across goroutines? For example, for each incoming HTTP request I want to put a row in a single table, can I have a global writer and within each response handler call AppendRows?

If it is NOT safe for concurrent use, then is there any guidance on how to use the managedwriter with a single table that we expect to be inserting lots of rows into from lots of concurrent requests? There's a quota limit of 1,000 connections in a region [1] but because of multiplexing that doesn't necessarily translate to 1,000 writer instances since theoretically many writers can share a single connection, correct? But I imagine if you did have 1,000 writer instances they probably can't all use a single connection because of HTTP/2 stream limits. So we need to be cognizant of the per-region limit but maybe we can get away with having a pool of writers that a request can pick from and then put back into when its done.

[1] https://cloud.google.com/bigquery/quotas#write-api-limits

shollyman commented 1 year ago

Sorry, I haven't gotten to this one yet.

The basic answer is yes, you can share writers across goroutines. If you're appending while using offsets it's probably not recommended as your write ordering is more challenging, but those are effectively user errors that are communicated back from the service (the offset errors in the StorageError enum). Locking is in present within the connection abstraction as that's where we correlate requests and responses. This is currently modeled as a global FIFO queue per connection, but there's work to make per-destination FIFOs a thing which should improve queue latencies.

To fully put user considerations to bed, we likely need to support scaling up the connection and load balancing logic so that a single default stream can scale beyond a single connection. With that, multiplexing can handle more of the use cases transparently.

googleapis / google-cloud-go

bigquery/storage/managedwriter: better documentation for AppendRows usage across goroutines #8485