googleapis / nodejs-bigquery-storage

BigQuery Storage Node.js client
Apache License 2.0
34 stars 17 forks source link

Write Disposition TRUNCATE for the Storage Write API #520

Closed BlakeHyde closed 3 weeks ago

BlakeHyde commented 3 weeks ago

I would like to be able to write to BigQuery using the Storage Write API with a writeDisposition of "truncate." I filed a support ticket for this with GCP support, and was told that this seems to be a missing feature in the NodeJS client, despite the behavior being documented for the Java client, with the suggestion that I file an issue here as a follow-up.

Is this a planned feature, or is there a constraint on the library design that prevents its implementation?

Thanks.

alvarowolfx commented 3 weeks ago

hey @BlakeHyde thanks for the report. What is your use case for that ?

The BigQuery Storage Write API doesn't support any kind of WriteDisposition configuration. It's a Query/Load/Copy Job configuration via the BigQuery v2 API.

If you want to start a WriteStream with the BigQuery Storage Write API, but truncate the table before doing so, you should do it using the BigQuery v2 API with a query job ( using a SQL TRUNCATE TABLE statement ) and then start writing data.

For the aforementioned article, the author found out in the same article (here) that the initial response was actually incorrect and such method doesn't exist in the Java SDK (again, because is not a feature available on the API side). You can also check the Java SDK Docs that the setWriteDisposition method doesn't exists:

BlakeHyde commented 3 weeks ago

Essentially, I want to use the Storage API for most use cases going forward for consistency, because we have some jobs where the previous API is not acceptable due to its non-streaming "streaming" behavior. I also thought that the existing API was, to an extent, deprecated; perhaps this is mistaken.

For this specific project, I was hoping to be able to stream a large amount of data into a table that is, regardless of its large size, newly calculated on a regular basis. In other words, there's no need for maintaining existing data in the table over time. I can, of course, use a previous job to delete data from the table, but was hoping to maintain the transactional characteristics of a single call with the PENDING mode set.

One thing I'm unclear about, though, is the recommendation of the TRUNCATE TABLE call in particular. I've run into a problem using this separately previously because there is a "truncating" state that the table then enters that is unpredictable in length. Is there a way to overcome that?

alvarowolfx commented 3 weeks ago

@BlakeHyde I had a chat with some other engineers and one interesting way for achieving that is by writing data using the BQ Storage Write API to a temporary table and then running a LOAD job that truncates the target table with the contents of the temporary table. The LOAD job can be set up with the WriteDisposition configuration to truncate the table.

There is a similar process that follow that pattern in the Spark Dataproc connector. But they use a QUERY Job with a MERGE SQL statement . For reference: https://github.com/GoogleCloudDataproc/spark-bigquery-connector/blob/14fa9b879b62a535c37c906ffb76386b8d144fef/bigquery-connector-common/src/main/java/com/google/cloud/bigquery/connector/common/BigQueryClient.java#L353

Do you think that would be a good fit for your use case ?

BlakeHyde commented 3 weeks ago

Yeah, writing to a temporary table and then using a LOAD job will likely be a viable path. Thanks!

alvarowolfx commented 3 weeks ago

cool @BlakeHyde , let me know if can help more on that and we can re open this issue.