googleapis / nodejs-bigquery

Node.js client for Google Cloud BigQuery: A fast, economical and fully-managed enterprise data warehouse for large-scale data analytics.
https://cloud.google.com/bigquery/
Apache License 2.0
462 stars 209 forks source link

bigquery: support raw configuration #7

Closed stephenplusplus closed 6 years ago

stephenplusplus commented 6 years ago

From @c0b on March 19, 2017 22:6

The nodejs API is the most flexible (comparing to other language bindings like https://github.com/GoogleCloudPlatform/google-cloud-go/issues/554#issuecomment-286887199) that allows to pass in any key value pairs in the configuration.query object which may be in the REST API support but not in Client libraries, I worked around that in the past months for query parameters

https://googlecloudplatform.github.io/google-cloud-node/#/docs/bigquery/0.8.0/bigquery?method=startQuery

now am researching the bq command line has a label feature, it seems useful to manage when there are too many load / query jobs, I want to tag them into different labels,

But from the REST API Reference, the labels should be set on configuration object, so that it can apply to query load copy extract all kinds of jobs; when I tried to pass labels: { k1: 'v1', ... } to the startQuery it seems passed to within the configuration.query object, then server side seems ignoring that, then the returned job doesn't come with any labels

https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.query

An example from bq tool:

$ bq --apilog - --nosync --format prettyjson query --label k1:v1 '#standardSQL
          SELECT 1'

INFO:root:body: {"configuration": {"query": {"query": "#standardSQL\nSELECT 1"}, "labels": {"k1": "v1"}}, "jobReference": {"projectId": "XXXXX", "jobId": "bqjob_XXXXXXXXXXXXXXXXX"}}

INFO:root:{
 "kind": "bigquery#job",
 "etag": "\"..........................................\"",
 "id": "..............",
 "selfLink": "https://www.googleapis.com/bigquery/v2/projects/.........",
 "jobReference": {
  ...
 },
 "configuration": {
  "query": {
   "query": "#standardSQL\nSELECT 1",
   "destinationTable": {
      ...
   },
   "createDisposition": "CREATE_IF_NEEDED",
   "writeDisposition": "WRITE_TRUNCATE",
   "useLegacySql": false
  },
  "labels": {
   "k1": "v1"
  }
 },
 "status": {
  "state": "RUNNING"
 },
 "statistics": {
  "creationTime": "1489959702315",
  "startTime": "1489959702555"
 },
...

This command line bq tool seems the only one support labels on job right now, I have researched all GoogleCloudPlatform/google-cloud-node GoogleCloudPlatform/google-cloud-go GoogleCloudPlatform/google-cloud-python python still needs GoogleCloudPlatform/google-cloud-python#2931 to add support for labels on dataset / tables, but NodeJS support labels on dataset / table by nature of setMetadata method calling PATCH, doesn't need any special work in the client libraries

But for labels on Job, I believe the NodeJS client still require some changes to support; the Python and Go would require similar changes as well.

Copied from original issue: GoogleCloudPlatform/google-cloud-node#2107

stephenplusplus commented 6 years ago

From @lukesneeringer on March 20, 2017 16:15

Hello @c0b. Thanks for reporting. This seems feasible. We will look into it soon.

stephenplusplus commented 6 years ago

From @c0b on March 21, 2017 1:47

for better management of different jobs, to use a customized jobid is also something needed, from listJobs I can get the jobId(s) look like these, I can see bqjob_ are from bq command line, bquijob_ are from bigqueryUI, I want all queries from my nodejs app to start with some special naming pattern,

job_ZhgN2K65s_XT1Yq7ooMJhG0TGcQ
job_xAN6iV-dcfb1XnCbpHkOdJ542JY
bquijob_74a81cc0_15aea476e69
bquijob_3c27ca25_15aea468ab4
bquijob_50d0e7e0_15aea45b2dc
job_qGBPt_F45gotmov1sdKobAgTEjk
job_gqUriVZhciWVHrSokbQ2AGIkGXo
job_oRykICSs6Tc7gm4dOQdt3nyQSpc
bqjob_r1f987f469a3e2a33_0000015
bqjob_r56059b08bb57b6ee_0000015

from the REST Reference, both sync and async job can start query with a specified jobReference with customized jobId https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs

I have tried put a jobReference object on the startQuery but it seems ignored?

  // jobReference need to be on top level of POST body object for job insert REST API
  jobReference: {
    projectId: 'my-project',
    jobId: 'myjob_' + Date.now().toString(36) + '_' + Math.random().toString(36).substr(2),
  },

this feature request could be another issue if you want me to file a separate issue#

stephenplusplus commented 6 years ago

From @lukesneeringer on March 21, 2017 19:32

Yes, please file distinct issues for distinct problems. :-)

stephenplusplus commented 6 years ago

From @lukesneeringer on March 21, 2017 19:46

As best as I can tell, the issue is that the labels are not actual query parameters, but they live one level up.

We take arbitrary query parameters as part of options.params here:

  if (options.params) {
    options.useLegacySql = false;
    options.parameterMode = is.array(options.params) ? 'positional' : 'named';

    if (options.parameterMode === 'named') {
      // Redacted; not relevant.
    } else {
      options.queryParameters = options.params
        .map(BigQuery.valueToQueryParameter_);
    }

...and then it looks like this in a request:

    // Create a job.
    self.request({
      method: 'POST',
      uri: '/queries',
      json: options
    }, responseHandler);

queryParameters is not the same as configuration, which is what I see in the JSON block that the command line tool sent. I am a little afraid to just shift the labels up to be a peer of queryParameters since I am not sure what the effect will be.

@jmuk Please advise.

stephenplusplus commented 6 years ago

From @jmuk on March 21, 2017 22:8

@lukesneeringer -- it seems you're referring query method, but I think the topic here is about startQuery; and indeed it only accepts query property for configuration as https://github.com/GoogleCloudPlatform/google-cloud-node/blob/master/packages/bigquery/src/index.js#L959

But anyhow, the point is about other properties of the configuration like labels will not be specified. I have little ideas how to specify labels through options -- it will depend on which of the other properties in the configurations we want to allow. Does anyone have ideas? If it's labels specific, we can special-handle options.labels.

It's probably better to ask @stephenplusplus and @callmehiphop for the help.

stephenplusplus commented 6 years ago

From @c0b on March 23, 2017 18:53

ok, so, on the line I made this local changes to make it work, for both #2107 and #2121 to move labels and jobReference to their correct location

https://github.com/GoogleCloudPlatform/google-cloud-node/blob/master/packages/bigquery/src/index.js#L963

➸ diff -U7 -rNp -- ./node_modules/@google-cloud/bigquery/src/index.js{.orig,}
--- ./node_modules/@google-cloud/bigquery/src/index.js.orig 2017-03-23 18:48:53.730426569 +0000
+++ ./node_modules/@google-cloud/bigquery/src/index.js  2017-03-23 18:25:54.893332089 +0000
@@ -957,14 +957,23 @@ BigQuery.prototype.startQuery = function

   var body = {
     configuration: {
       query: extend(true, defaults, options)
     }
   };

+  if (body.configuration.query.labels) {
+    body.configuration.labels = body.configuration.query.labels;
+    delete body.configuration.query.labels;
+  }
+  if (body.configuration.query.jobReference) {
+    body.jobReference = body.configuration.query.jobReference;
+    delete body.configuration.query.jobReference;
+  }
+
   this.request({
     method: 'POST',
     uri: '/jobs',
     json: body
   }, function(err, resp) {
     if (err) {
       callback(err, null, resp);

although a full PR should cover query and others as well.

stephenplusplus commented 6 years ago

From @lukesneeringer on March 23, 2017 20:12

I would almost be inclined to go the other direction -- iterate over the keys we know should be under configuration and move them out into a configuration dict. That approach would be more future-proof.

stephenplusplus commented 6 years ago

From @c0b on March 23, 2017 20:17

sure, that should be a better and ultimate solution, an interface like below?

startQuery(queryOptions, configurationOptions, postBodyOptions)

labels would on configurationOptions, and jobReference on the postBodyOptions ?

stephenplusplus commented 6 years ago

Let's force a string-first argument for the query, then an options object in the format of a configuration as described here:

var options = {
  destination: bigquery.dataset('higher_education').table('institutions'),
  labels: [...],
  query: {}
}

bigquery.startQuery('SELECT url FROM ....', options, callback)

options.destination is the only custom property we have, so just remove it from options and send everything else straight through, after assigning the query string to the right place. Not tested, but something like:

BigQuery.prototype.startQuery = function(query, options, callback) {
  var self = this;

  if (!query) {
    throw new Error('A SQL query string is required.');
  }

  options = options || {};

  var configuration = extend(true, {}, options);
  delete configuration.destination;

  var defaultConfiguration = {
    query: {
      query: query
    }
  };

  if (options.destination) {
    if (!(options.destination instanceof Table)) {
      throw new Error('Destination must be a Table object.');
    }
    defaultConfiguration.query.destinationTable = {
      datasetId: options.destination.dataset.id,
      projectId: options.destination.dataset.bigQuery.projectId,
      tableId: options.destination.id
    };
  }

  var body = {
    configuration: extend(true, defaultConfiguration, options)
  };

  this.request({
    method: 'POST',
    uri: '/jobs',
    json: body
  }, ...
};
stephenplusplus commented 6 years ago

From @lukesneeringer on March 27, 2017 21:59

@stephenplusplus The problem is that the labels should not live in the configuration object.

stephenplusplus commented 6 years ago

Are you sure? That's where it is in the docs: https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.labels

I thought the problem is it shouldn't live in configuration.query.

stephenplusplus commented 6 years ago

From @lukesneeringer on March 27, 2017 22:4

You are completely right and I am completely wrong. :-)

stephenplusplus commented 6 years ago

From @c0b on April 1, 2017 8:27

before you release the fixed @google-cloud/bigquery on npm, I'm using this diff as workaround for this #2107 #2150 and #2121 waiting your better approach

--- ./node_modules/@google-cloud/bigquery/src/index.js.orig 2017-03-23 11:48:53.730426569 -0700
+++ ./node_modules/@google-cloud/bigquery/src/index.js  2017-03-28 17:30:51.561750235 -0700
@@ -957,14 +957,27 @@ BigQuery.prototype.startQuery = function

   var body = {
     configuration: {
       query: extend(true, defaults, options)
     }
   };

+  if (body.configuration.query.dryRun) {
+    body.configuration.dryRun = body.configuration.query.dryRun;
+    delete body.configuration.query.dryRun;
+  }
+  if (body.configuration.query.labels) {
+    body.configuration.labels = body.configuration.query.labels;
+    delete body.configuration.query.labels;
+  }
+  if (body.configuration.query.jobReference) {
+    body.jobReference = body.configuration.query.jobReference;
+    delete body.configuration.query.jobReference;
+  }
+
   this.request({
     method: 'POST',
     uri: '/jobs',
     json: body
   }, function(err, resp) {
     if (err) {
       callback(err, null, resp);
stephenplusplus commented 6 years ago

That works, and this should work as well, so you don't have to modify the source code:

var bigquery = require('@google-cloud/bigquery')({...})

bigquery.interceptors.push({
  request: function(reqOpts) {
    if (reqOpts.uri === '/jobs') {
      var body = reqOpts.json;

      if (body.configuration.query.dryRun) {
        body.configuration.dryRun = body.configuration.query.dryRun;
        delete body.configuration.query.dryRun;
      }

      if (body.configuration.query.labels) {
        body.configuration.labels = body.configuration.query.labels;
        delete body.configuration.query.labels;
      }

      if (body.configuration.query.jobReference) {
        body.jobReference = body.configuration.query.jobReference;
        delete body.configuration.query.jobReference;
      }
    }

    return reqOpts;
  }
});
stephenplusplus commented 6 years ago

From @c0b on April 2, 2017 18:23

thanks for the advanced usage; I just read the Interceptors from https://googlecloudplatform.github.io/google-cloud-node/#/docs/google-cloud

callmehiphop commented 6 years ago

@stephenplusplus I think that BigQuery#createJob should resolve this issue. WDYT?

stephenplusplus commented 6 years ago

Yep, that sounds right to me!