googleapis / nodejs-bigquery

Node.js client for Google Cloud BigQuery: A fast, economical and fully-managed enterprise data warehouse for large-scale data analytics.
https://cloud.google.com/bigquery/
Apache License 2.0
466 stars 211 forks source link

Several issues with stateless queries #1399

Closed alex-statsig closed 1 month ago

alex-statsig commented 2 months ago

Very excited by https://github.com/googleapis/nodejs-bigquery/pull/1337, but ran into two main issues:

  1. The types seem to be incorrect for client.query(query, options) (and client.query(query)). The types indicate the return type is [ResultsArray, IJob] (note in particular that job is not nested under a key "job" for the second value), whereas in practice I'm seeing a return shape of [ResultsArray, {job: IJob}, JobStatistics]. I'm unsure if the signature will change, so would like to make sure the types/return value is correct before using this
  2. jobTimeoutMs cannot be set (only timeoutMs). Additionally, timeoutMs does not behave as expected. It seems like the request should max out at timeoutMs or some duration near that (perhaps RTT + timeoutMs). In reality, I'm observing it last for much longer if the job takes a while. It seems like the timeout is used for the initial jobsQuery request, then it is removed and the job continues to be awaited. This is critical to having some safety net for the maximum duration the fetch will take (jobTimeoutMs would be the preferred mechanism though).

Environment details

Steps to reproduce

  1. (Optionally) enable QUERY_PREVIEW_ENABLED env variable. I believe the issues reproduce without this too though For the first bug,
  2. Call client.query(jobOptions) (such as client.query({query: '<sql>'}))
  3. Observe that three elements are returned instead of two, and the shape is incorrect For the second bug,
  4. Call client.query(jobOptions, { timeoutMs: 1000 }) with a job that would take >10s to execute.
  5. Observe that the request hangs for >10s rather than aborting at 1s

Very excited to try out these preview features, but these are the issues blocking me from actually using it currently

alvarowolfx commented 2 months ago

hey @alex-statsig, thanks for the report.

For question 1:

For the type returned by the query method, for historical/backward compatibility reasons, the method signature shows that it can return both a Promise<QueryRowsResponse> (which has 3 positional parameters) or Promise<SimpleQueryRowsResponse> (which has only 2), and the signature has been the same for a while as far as I checked here with git blame.

It is indeed a bit confusion and it can either contain a Query configuration, which contains a IJob (return type is QueryRowsResponse) or it can contains just a Job directly (SimpleQueryRowsResponse). The more usual is going to be the QueryRowsResponse like you are seeing. SimpleQueryRowsResponse is returned on dryRun queries and when we have an error creating a query job. Again for historical reason has been like that and I wanted to remove SimpleQueryRowsResponse all together, but I was afraid of someone depending on that for error handling. Maybe I'll do that in the next coming breaking change scheduled for the next node version upgrade.

Weirdly enough, the 2 or 3 positional returned parameters was broken for some time, until we fixed it on PR #1365. You can see the integration test for the 3 positional parameters return with the query method here: https://github.com/googleapis/nodejs-bigquery/pull/1365/files#diff-5e68f218ffbdd4cae3bcb93aa62c0515f1fedff0f6c4fe40986d6ee9932b08a0R346

For question 2:

I'm going to investigate the timeout further here to make sure if is working properly. But to give more context, for jobTimeoutMs: the jobs.query endpoint basically does a job creation (or not with stateless queries) and get the query results on the same call. But sometimes that query job can take a longer time to run, which is where the timeoutMsconfiguration comes in, where it controls how long you want to way for the query job to be done, before the client side has to fallback to jobs.getQueryResults. And in that path, we are not allowed to control the BQ job timeout. The BQ Job timeout is how long do you want the job run before it is cancelled and this is not even precise, because as it's a distributed system, cancelling the query can take a bit longer than the configured timeout.

You can see details on jobTimeoutMs here in the docs: https://cloud.google.com/bigquery/docs/reference/rest/v2/Job and timeoutMs for jobs.query here: https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/query#queryrequest

But I'll investigate it here, because maybe the call to jobs.query it's indeed getting timed out, but if auto pagination is on, we are fallback'ing to fetch the results using jobs.getQueryResults. Which is a similar scenario that you reported that might be happening.

alex-statsig commented 2 months ago

Thanks so much for the detailed/quick reply!

For question 1: ...

Got it, yeah I mostly wanted to understand exactly what types I should expect (i.e. are SimpleQueryRowsResponse and QueryRowsResponse both expected viable outputs I should handle).

One further question on this though, the types indicate QueryRowsResponse will be returned if query is a string, whereas SimpleQueryRowsResponse is returned when query is a Query object. In my case, I'm passing in a Query object but getting back 3 positional arguments (what seems to be a QueryRowsResponse). So it's not just that the types are ambiguous, they seem to actually be incorrect from what I can tell.

In general my goal is to log statistics about the query after running it, but it's very confusing figuring out where I should grab the information needed to lookup the job statistics. I currently settled on something like const [rows, poorlyTypedJob] = await client.query(queryObj, options), followed by const realJob = (poorlyTypedJob as any).job ?? poorlyTypedJob; (since the types indicate poorlyTypedJob will be IJob, but in practice I'm seeing the job under poorlyTypedJob.job). I don't feel comfortable using such a hack in production though, but maybe I can just cast the original response to SimpleQueryRowsResponse | QueryRowsResponse (since you seem to be saying that's the expected return shape) and ensure I handle both of those correctly.

And in that path, we are not allowed to control the BQ job timeout.

Yeah my main concern is there's no way to control costs with this function (i.e. a runaway query could keep running for a while right?). But I guess this is independent of this driver, and more related to the underlying feature. I'm unsure why this would be still, but maybe because "it's a distributed system" it requires the overhead of a stateful job to do that?

But I'll investigate it here, because maybe the call to jobs.query it's indeed getting timed out, but if auto pagination is on, we are fallback'ing to fetch the results using jobs.getQueryResults. Which is a similar scenario that you reported that might be happening.

Yeah I think this is what I've observed; it seemed to me like "jobComplete" was false in the initial request due to the timeout, but then it automatically retried without the timeoutMs (logic here)

alvarowolfx commented 2 months ago

the types indicate QueryRowsResponse will be returned if query is a string, whereas SimpleQueryRowsResponse is returned when query is a Query object

This is not the expected behavior, you should get a QueryRowsResponse regardless of using a Query as object or just a SQL string. Do you have a repro case where you see a SimpleQueryRowsResponse that is not an error case like I described before ?

In general my goal is to log statistics about the query after running it

If you planning to use JobStatistics, you should not rely on the query method, as in some cases, there would not be even a BQ job, like in a case we have a stateless query ( for now is in preview, but we are planning to make it the default behavior). This sample would the right way to always have a Job: https://github.com/googleapis/nodejs-bigquery/blob/main/samples/query.js

alex-statsig commented 2 months ago

This is not the expected behavior, you should get a QueryRowsResponse regardless of using a Query as object or just a SQL string

The types here indicate otherwise though right? There is an explicit overload saying if you pass Query, the return type will be SimpleQueryRowsResponse (not QueryRowsResponse). If the type was SimpleQueryRowsResponse | QueryRowsResponse that would be fine, but it seems explicitly incorrect to me

It seems like I am getting QueryRowsResponse for a Query object, so I think you're correct about what is actually returned

If you planning to use JobStatistics, you should not rely on the query method, as in some cases, there would not be even a BQ job, like in a case we have a stateless query ( for now is in preview, but we are planning to make it the default behavior). This sample would the right way to always have a Job: https://github.com/googleapis/nodejs-bigquery/blob/main/samples/query.js

Got it, my understanding was you could still look up statistics for a jobless query (ex. the docs mention "Instead, it returns a queryId field which you can use to get insights about the query using the INFORMATION_SCHEMA.JOBS view"). I don't need JobStatistics, but if at all possible for cost-monitoring purposes I would like to track the slot usage from my application still. And it does seem like I'm able to do that still with client.query() and QUERY_PREVIEW_ENABLED but maybe I'm mis-using something edit: seems like the particular query I was testing must have not been eligible due to temp tables being used (I was testing a more expensive one to see how timeouts behaved), I see jobCreationReason "other". Testing on a simpler query, the jobstatistics doesn't work with the queryID.

alvarowolfx commented 2 months ago

hey @alex-statsig, I got a bit busy this week with other things and could not check yet the timeout issue yet, but I'll try to get to it today as early next week.

Got it, my understanding was you could still look up statistics for a jobless query (ex. the docs mention "Instead, it returns a queryId field which you can use to get insights about the query using the INFORMATION_SCHEMA.JOBS view"). I don't need JobStatistics, but if at all possible for cost-monitoring purposes I would like to track the slot usage from my application still. And it does seem like I'm able to do that still with client.query() and QUERY_PREVIEW_ENABLED but maybe I'm mis-using something

I think you're right, but I need to double check that, I think queryId were not showing up yet on that view, but I'll confirm that too.

alvarowolfx commented 1 month ago

hey @alex-statsig I've just opened https://github.com/googleapis/nodejs-bigquery/pull/1402 to fix the timeout issue when using jobs.query. I asked for some internal reviews, but we should get this out this week.