mbhynes commented 1 year ago

Summary

This PR creates an adapter for Google BigQuery based on the bq CLI tool
It incorporates and builds on the commits in https://github.com/tpope/vim-dadbod/pull/118 (thanks @mmngreco!) and addresses the feedback provided by @tpope there

Issues Addressed

Implementation Notes

The adapter calls the bq query command and passes the buffer query to stdin

Questions / Notes for @tpope

(resolved) Context: `bq` authentication

Invocations of the bq CLI are typically authenticated using the user's [or agent's] current google cloud SDK credentials, either directly from variables in the shell environment or from local configuration files
This environment-based authentication means that there isn't typically a need to provide user/password credentials in a connection string when connecting to BigQuery, unlike many OLTP databases (and their existing dadbod adapters)
This was the root of the problem encountered previously in https://github.com/tpope/vim-dadbod/pull/118, raised in this review thread:
- IIUC the issue stemmed from the db#connect calls in which vim-dadbod assumes that some credentials need to be passed to the database client CLI to open an interactive session
- Since there isn't anything to pass in, this is problematic and causes BQ to read an empty file to the stdin (it's expecting a query, not interactive input)

(resolved) Duplicate `bq` Calls

To address the auth issue and still pass the query using the stdin, I've implemented a dummy auth input that passes select 1 to bq:
This means that there are 2 round trips made to BQ to execute a query (which does technically confirm that the user can authenticate), and since it's an OLAP database it's pretty damn slow to do this; each query, small or large, has 1-2 seconds overhead
All in all it means that simple queries are twice as slow as they need to be. The extra 2 seconds probably doesn't matter for long-running queries, but this could definitely be improved. However, I don't know how to improve it based on my understanding of the current codebase;
- If the auth step could be skipped, that would be ideal!
- However if skipping it requires using the #inputs method that provides a file, this will be problematic for bq, since it only accepts a comandline string or stdin; there is AFAIK no way to pass a file according to the docs

Downstream `vim-dadbod-ui` Usage

This also integrates with the dadbod-ui as per https://github.com/kristijanhusak/vim-dadbod-ui/pull/160

asdf8601 commented 1 year ago

Hey @mbhynes, thank you so much for your time on this, it's really appreciated (I couldn't find time to do it by my own) ;-)

from the user point of view, I've tried it and it works almost as expected! The only unexpected behaviour I found is when you call to DB without a query :DB bigquery://?format=sparse&use_legacy_sql=false then the whole screen get blank and seems frozen but a C-c returns everything back. To be honest not sure what others adapters do in the same situation.

mbhynes commented 1 year ago

Hey @mbhynes, thank you so much for your time on this, it's really appreciated (I couldn't find time to do it by my own) ;-)

No worries, thanks for giving me a starting point!

from the user point of view, I've tried it and it works almost as expected! The only unexpected behaviour I found is when you call to DB without a query :DB bigquery://?format=sparse&use_legacy_sql=false then the whole screen get blank and seems frozen but a C-c returns everything back.

Thanks for testing this; this is definitely an issue to fix for people who need legacy sql. It might have something to do with how I set a default dictionary of keys to merge with any user supplied values. I'll double check how the extend function works in vim (I may be too used to merging dictionaries in python and have introduced a bug)

tpope commented 1 year ago

Invocations of the bq CLI are typically authenticated using the user's [or agent's] current google cloud SDK credentials, either directly from variables in the shell environment or from local configuration files

This environment-based authentication means that there isn't typically a need to provide user/password credentials in a connection string when connecting to BigQuery, unlike many OLTP databases (and their existing dadbod adapters)

This is all well and good. Just to cover all our bases, when you say "typically", does that mean there's an exception?

This was the root of the problem encountered previously in enh(adapater): add bigquery support #118, raised in this review thread:

IIUC the issue stemmed from the db#connect calls in which vim-dadbod assumes that some credentials need to be passed to the database client CLI to open an interactive session

Since there isn't anything to pass in, this is problematic and causes BQ to read an empty file to the stdin (it's expecting a query, not interactive input)

Duplicate bq Calls

To address the auth issue and still pass the query using the stdin, I've implemented a dummy auth input that passes select 1 to bq:

This is indeed the solution intended by the existing design.

This means that there are 2 round trips made to BQ to execute a query (which does technically confirm that the user can authenticate), and since it's an OLAP database it's pretty damn slow to do this; each query, small or large, has 1-2 seconds overhead

All in all it means that simple queries are twice as slow as they need to be. The extra 2 seconds probably doesn't matter for long-running queries, but this could definitely be improved. However, I don't know how to improve it based on my understanding of the current codebase;

If the auth step could be skipped, that would be ideal!

However if skipping it requires using the #inputs method that provides a file, this will be problematic for bq, since it only accepts a comandline string or stdin; there is AFAIK no way to pass a file according to the docs

Before I dig into the rest of this, can you clarify: If BigQuery hypothetically did support a filename argument, and you could provide a #input method, would that solve the problem? If so, how? (I'm not seeing how it would change things, but if I'm wrong, that would absolutely inform how I proceed here.)

mbhynes commented 1 year ago

This is all well and good. Just to cover all our bases, when you say "typically", does that mean there's an exception?

I don't think so---at least not one that matters for this use case. As an aside, it's also common enough to use json credential files that store short-lived keys when applications interact with google cloud using service accounts. But even if a human user wants to authenticate and run queries as a service agent, they can still do the agent auth prior to running bq and use credentials in the environment.

Before I dig into the rest of this, can you clarify: If BigQuery hypothetically did support a filename argument, and you could provide a #input method, would that solve the problem? If so, how? (I'm not seeing how it would change things, but if I'm wrong, that would absolutely inform how I proceed here.)

No I don't believe so. I'm sorry to have implied that; looking again at the db.vim file and running a few tests I see that there's always a preliminary auth query sent to bigquery, so IIUC the current implementation will always send 2 queries.

Just as an aside, between a solution using #input or #interactive, passing the path of the tmpfile containing the query with --flagfile <filename> option does actually work as intended if the dummy auth select 1; is provided by #auth_input, i.e. if the adapter has the following 2 functions:

function! db#adapter#bigquery#auth_input() abort
  return 'select 1;'
endfunction

function! db#adapter#bigquery#input(url, in) abort
  return s:command_for_url(a:url, 'query') + ['--flagfile', a:in]
endfunction

However between using stdin or the --flagfile from the previous PR, I think stdin is much safer because it's very well documented in both the CLI and on the internet as the intended way to pass a query to bq if not specified as an inline string. The --flagfile option is documented as a way to provide commandline flag arguments in a file. It happens to allow a query in a file that doesn't begin with -- to be parsed as a string, but it's not explicitly documented to work this way...

mbhynes commented 1 year ago

from the user point of view, I've tried it and it works almost as expected! The only unexpected behaviour I found is when you call to DB without a query :DB bigquery://?format=sparse&use_legacy_sql=false then the whole screen get blank and seems frozen but a C-c returns everything back. To be honest not sure what others adapters do in the same situatio

@mmngreco I made the edits to remove the default to standard sql, and sample queries work for me using that adapter address! Could you please try this again on your setup to confirm it's fixed for you?

tpope commented 1 year ago

However between using stdin or the --flagfile from the previous PR, I think stdin is much safer

The only reason to bother with #input is if it offers a distinct advantage over stdin, for example, error messages that contain the filename, or if stdin isn't supported at all. I think we can safely rule it out here.

The first and most obvious purpose of the auth check call is to detect the need for a password and subsequently prompt for it if necessary. But it also serves a more general purpose of verifying we can access the database at all, and reporting an error if not. So the first question is, does bq have another way to check for successful access that's any faster? If so, we could slot that into the existing design with minimal disruption.

To get rid of the check entirely I need you to test what happens exactly when there's an error (e.g., not signed into Google Cloud SDK) and report back. Try both with the default #auth_pattern workflow, and then again with it short circuited, like this:

diff --git i/autoload/db.vim w/autoload/db.vim
index 2ba7ab9..04c7095 100644
--- i/autoload/db.vim
+++ w/autoload/db.vim
@@ -321,6 +321,7 @@ function! db#connect(url) abort
   if has_key(s:passwords, resolved)
     let url = substitute(resolved, '://[^:/@]*\zs@', ':'.db#url#encode(s:passwords[url]).'@', '')
   endif
+  return url
   let pattern = db#adapter#call(url, 'auth_pattern', [], 'auth\|login')
   let input = tempname()
   try

tpope commented 1 year ago

Hey @mbhynes, thank you so much for your time on this, it's really appreciated (I couldn't find time to do it by my own) ;-)

from the user point of view, I've tried it and it works almost as expected! The only unexpected behaviour I found is when you call to DB without a query :DB bigquery://?format=sparse&use_legacy_sql=false then the whole screen get blank and seems frozen but a C-c returns everything back. To be honest not sure what others adapters do in the same situation.

What normally happens when you don't give a query is it starts an interactive session. Try :DB sqlite: to see what I mean. Does BigQuery support anything like this?

mbhynes commented 1 year ago

What normally happens when you don't give a query is it starts an interactive session. Try :DB sqlite: to see what I mean. Does BigQuery support anything like this?

No, unfortunately not.

mbhynes commented 1 year ago

So the first question is, does bq have another way to check for successful access that's any faster? If so, we could slot that into the existing design with minimal disruption.

Possibly; there's a flag that tells bq to validate but not run a query. I can look at this and dig into the docs for bq or gcloud later to see if there's a way to do this that cuts down on the round-trip time.

To get rid of the check entirely I need you to test what happens exactly when there's an error (e.g., not signed into Google Cloud SDK) and report back. Try both with the default #auth_pattern workflow, and then again with it short circuited, like this:
diff --git i/autoload/db.vim w/autoload/db.vim
index 2ba7ab9..04c7095 100644
--- i/autoload/db.vim
+++ w/autoload/db.vim
@@ -321,6 +321,7 @@ function! db#connect(url) abort
   if has_key(s:passwords, resolved)
     let url = substitute(resolved, '://[^:/@]*\zs@', ':'.db#url#encode(s:passwords[url]).'@', '')
   endif
+  return url
   let pattern = db#adapter#call(url, 'auth_pattern', [], 'auth\|login')
   let input = tempname()
   try

Thanks for this suggestion; I'll take a look and get back to you with more info.

tpope commented 1 year ago

What normally happens when you don't give a query is it starts an interactive session. Try :DB sqlite: to see what I mean. Does BigQuery support anything like this?

No, unfortunately not.

If interactive sessions aren't supported, then #interactive should not be provided. Instead, rename it to #filter.

mbhynes commented 1 year ago

To get rid of the check entirely I need you to test what happens exactly when there's an error (e.g., not signed into Google Cloud SDK) and report back. Try both with the default #auth_pattern workflow, and then again with it short circuited, like this [...]

Hi Tim, thanks for suggesting this, and please find the results below:

Bad query with dummy #auth_input():
- In this setup I provided a query that will fail due to a simple authorization error, as I've provided a project_id that is invalid or to which I don't have access: :DB bigquery:///?format=sparse&project_id=bad select * fromregion-us.INFORMATION_SCHEMA.TABLES;
- The result is the expected error message from bq, displayed in red at the bottom of the window:
Bad query with the above diff applied:
- I applied the diff above with git apply and then ran the same query
- In this case, the error is still shown, but it's shown in the result buffer of the query

The second case is what happens normally off the head of this branch (without the diff applied) if I run an invalid query. Using a typo in the query on purpose, I get the same result as case 2 (with a different error message of course), which makes sense since the dummy auth select 1 query will have succeeded, allowing the 2nd query with the typo to run:

:DB bigquery:/// select * from `region-us`.INFORMATION_SCHEMA.TABLESs;

mbhynes commented 1 year ago

Possibly; there's a flag that tells bq to validate but not run a query. I can look at this and dig into the docs for bq or gcloud later to see if there's a way to do this that cuts down on the round-trip time.

This doesn't look promising; there's a flag to validate the query but bq query --dry_run 'select 1' takes just as long for me as as bq query 'select 1'

tpope commented 1 year ago

What about other commands, like bq show or bq list?

tpope commented 1 year ago

Unrelated to anything above, I would like to push back against the use of query parameters here. URLs are used by Dadbod to encapsulate connection parameters. They are extremely well suited to the purpose: user, password, host, and port map cleanly onto 4 common database parameters; the database name conceptually fits in the path field; and database-specific odds-and-ends can be slotted into arbitrary query parameters. BigQuery, lacking the big 4 parameters, is an outlier that doesn't really benefit from the use of a URL, but that's fine, so is SQLite and we still support that.

But the important part here is that URLs are for "connection parameters". That covers things like --api=https://example.com (side note: maybe host is a valid parameter after all?) and --ca_certificates_file=PATH. It absolutely does not cover --format=sparse. I think the Dadbod position would be if you want --format=sparse, you should put it in .bigqueryrc, and if you want it on a per-query basis, well, there isn't a provided solution for that. Sticking it in the query parameters goes against the grain because the whole design is built around reusing URLs from request to request (e.g., by saving them in variables), not constantly tweaking them for presentation reasons.

At most, I would be willing to tolerate allowing arbitrary query parameters for arbitrary command-line options, as this may indeed be better than nothing, but the examples in the documentation should be limited to officially sanctioned connection parameters.

My objections don't end there. BigQuery accepts 2 different kinds of command-line options: global options, and subcommand options. These are distinguished by their position either before or after the name of the subcommand. The adapter handles these two cases by accepting them all mixed together in the same namespace, and then teasing them apart with hard-coded a list of every single global option. I'm not even going to bother explaining why this is unacceptable. If we really must accept not just global but also subcommand options, then they should be in different namespaces. My proposal would be for subcommand options to follow the common [] practice for nested parameters, for example ?query[use_legacy_sql]=true. Although dropping subcommand options entirely seems like a reasonable approach too: None of them look like connection parameters to me.

mbhynes commented 1 year ago

What about other commands, like bq show or bq list?

Those commands will still have the extra few seconds overhead, and won't strictly speaking be able to guarantee that a subsequent query can be run successfully because of nature of IAM permissions for user accounts. A user may be able to successfully list a dataset's tables, but that doesn't necessarily mean that the user can query those tables or launch a job. The access control is fine-grained and depends on the policies of the org you're working in.

For more context too, it's pretty seldom that a user has to actually authenticate to bigquery. You might run gcloud auth login when you need to set the user account you're working with and obtain an access token, but unless you change that user, you probably won't change that token or re-run the auth for a long time. The token authentication is pretty different than authentication to a lot of OLTP dbs

mbhynes commented 1 year ago

Unrelated to anything above, I would like to push back against the use of query parameters here. URLs are used by Dadbod to encapsulate connection parameters. They are extremely well suited to the purpose: user, password, host, and port map cleanly onto 4 common database parameters; the database name conceptually fits in the path field; and database-specific odds-and-ends can be slotted into arbitrary query parameters. BigQuery, lacking the big 4 parameters, is an outlier that doesn't really benefit from the use of a URL, but that's fine, so is SQLite and we still support that.

But the important part here is that URLs are for "connection parameters". That covers things like --api=https://example.com (side note: maybe host is a valid parameter after all?) and --ca_certificates_file=PATH. It absolutely does not cover --format=sparse. I think the Dadbod position would be if you want --format=sparse, you should put it in .bigqueryrc, and if you want it on a per-query basis, well, there isn't a provided solution for that. Sticking it in the query parameters goes against the grain because the whole design is built around reusing URLs from request to request (e.g., by saving them in variables), not constantly tweaking them for presentation reasons.

At most, I would be willing to tolerate allowing arbitrary query parameters for arbitrary command-line options, as this may indeed be better than nothing, but the examples in the documentation should be limited to officially sanctioned connection parameters.

What do you mean by officially sanctioned ? As in, CLI options documented by e.g. bq query --help?

My objections don't end there. BigQuery accepts 2 different kinds of command-line options: global options, and subcommand options. These are distinguished by their position either before or after the name of the subcommand. The adapter handles these two cases by accepting them all mixed together in the same namespace, and then teasing them apart with hard-coded a list of every single global option. I'm not even going to bother explaining why this is unacceptable. If we really must accept not just global but also subcommand options, then they should be in different namespaces. My proposal would be for subcommand options to follow the common [] practice for nested parameters, for example ?query[use_legacy_sql]=true. Although dropping subcommand options entirely seems like a reasonable approach too: None of them look like connection parameters to me.

Sure, do you have an existing utility that parses the nested parts of the connection string already in the repo? If you wouldn't mind sharing an example usage if that exists I'd much appreciate it 🙏

If the namespaced parameters are used, then I can drop the global list for any global/subcommand matching, and users can still specify query-scoped parameters. I think it would be better to enable this functionality, as some of them are useful to have on an ad hoc basis, as the data you're working with might be better visualized in particular ways (e.g. getting json for a particular json-encoding column, which happens a fair bit in the wild...). Sound like a plan?

tpope commented 1 year ago

At most, I would be willing to tolerate allowing arbitrary query parameters for arbitrary command-line options, as this may indeed be better than nothing, but the examples in the documentation should be limited to officially sanctioned connection parameters.

What do you mean by officially sanctioned ? As in, CLI options documented by e.g. bq query --help?

I just mean docs should avoid examples that include non-connection parameters, like format.

My objections don't end there. BigQuery accepts 2 different kinds of command-line options: global options, and subcommand options. These are distinguished by their position either before or after the name of the subcommand. The adapter handles these two cases by accepting them all mixed together in the same namespace, and then teasing them apart with hard-coded a list of every single global option. I'm not even going to bother explaining why this is unacceptable. If we really must accept not just global but also subcommand options, then they should be in different namespaces. My proposal would be for subcommand options to follow the common [] practice for nested parameters, for example ?query[use_legacy_sql]=true. Although dropping subcommand options entirely seems like a reasonable approach too: None of them look like connection parameters to me.

Sure, do you have an existing utility that parses the nested parts of the connection string already in the repo? If you wouldn't mind sharing an example usage if that exists I'd much appreciate it pray

Nothing exists for this, but not too hard to throw something together:

function! s:build_cmd(subcmd, params) abort
  let global = []
  let sub = []
  for [k, v] in items(a:params)
    let match = matchlist(k, '^\([^[]\+\)\[\(.*\)\]$')
    if empty(match)
      call add(global, '--' . k . '=' .v)
    elseif match[1] ==# a:subcmd
      call add(sub, '--' . match[1] . '=' .match[2])
    endif
  endfor
  return ['bq'] + global + [a:subcmd] + sub
endfunction

Usage: let argv = s:build_cmd('query', url.params)

From earlier:

What normally happens when you don't give a query is it starts an interactive session. Try :DB sqlite: to see what I mean. Does BigQuery support anything like this?

No, unfortunately not.

What does bq shell do?

mbhynes commented 1 year ago

Hi Tim, sorry for the delay in getting back to you, haven't had a chance to come back to this during the week.

What do you mean by officially sanctioned ?

I just mean docs should avoid examples that include non-connection parameters, like format.

Sounds good, thanks for clarifying 👍

Sure, do you have an existing utility that parses the nested parts of the connection string already in the repo? If you wouldn't mind sharing an example usage if that exists I'd much appreciate it pray

Nothing exists for this, but not too hard to throw something together:

That's very much appreciated, thank you!

From earlier:

What normally happens when you don't give a query is it starts an interactive session. Try :DB sqlite: to see what I mean. Does BigQuery support anything like this?

No, unfortunately not.

What does bq shell do?

Well this is embarrassing, but I'm afraid I didn't even know bq shell existed ... I should have looked through the bq commands more closely, but was focused on the bq query subcommand a bit too myopically! 😓

Thank you for the bq shell suggestion; I've given it a quick test in this branch by removing the #auth_input and replacing the query with shell in #filter as;

function! db#adapter#bigquery#filter(url) abort
  return s:command_for_url(a:url, 'shell')
endfunction

This is pretty promising, because it:

🍏 it does successfully read the stdin to execute queries
🍏 it also runs other bq subcommands like bq ls
🍏 it doesn't have the extra 1-2 seconds round trip for the auth

The only disadvantage that I see is that the stdout from the bq shell command includes the console characters like the following (it has 1 extra line before and after the actual query output, as a well as a [project]>>> prompt that I've blurred from this screenshot):

All in all, that's a pretty minor drawback I think, and removing the extra select 1 query is way better for the user experience than a few extra output lines IMO. If there's an adapter function that applies formatting to the command output that would be a simple fix? I'm afraid nothing is jumping out at me in the existing adapters.

If you're OK with the extra lines in the output, I'll patch this PR and remove the global options as discussed above.

tpope commented 1 year ago

You can go back from #filter to #interactive, now that you have a command that works interactively.

Regarding the cruft in the output, have you tried bq --headless=true shell? bq --quiet=true shell? Looking for other options in bq shell --help?

mbhynes commented 1 year ago

Regarding the cruft in the output, have you tried bq --headless=true shell? bq --quiet=true shell? Looking for other options in bq shell --help?

I've gone through the globals documented in bq shell --help and afraid I haven't managed to get rid of the prompt message. In this case for the bq shell, I don't think there's a flag to silence it; I went looking for the source script on my system to see where it's coming from, and in my google-cloud-sdk it's from a stdout print line in google-cloud-sdk/platform/bq/bq.py.

(Found an unofficial copy of the bq.py here) as well for reference)

tpope commented 1 year ago

A few lines up suggests there's a --prompt option. Can that be used to silence the prompt? The welcome message is noisy, but it's the prompt that really makes things janky. (And the prompt would be much harder to strip out in a post-processing step.)

mbhynes commented 1 year ago

A few lines up suggests there's a --prompt option. Can that be used to silence the prompt?

I did try playing with the --prompt option, but no luck (actually I didn't even see any effect from changing it, even independent of dadbod with e.g. bq shell --prompt='notempty'; it still used my default gcp project as a prompt, e.g. project>. Based on the code screenshot above though, I don't think there's any way to get rid of the welcome message, since it's printed to stdout independently of the REPL loop that that prompt parameter is set for...

The welcome message is noisy, but it's the prompt that really makes things janky. (And the prompt would be much harder to strip out in a post-processing step.)

I tried with a few different project ids and it seems to be consistent in that it prints out [project]> to the stdout. A project identifier has to have only alphanumeric or hyphen characters (docs), so post-processing the default prompt [project]> rule could be manageable with a regex. WDYT? (Assuming of course you're OK with just documenting the limitation that this adapter wouldn't accept a custom --prompt if it does indeed work on other people's installations of bq)

tpope commented 1 year ago

I tried with a few different project ids and it seems to be consistent in that it prints out [project]> to the stdout. A project identifier has to have only alphanumeric or hyphen characters (docs), so post-processing the default prompt [project]> rule could be manageable with a regex. WDYT?

What about this format=sparse I've heard so much about? I assume that shows raw data starting in column 1? That creates the potential for a false positive. I wouldn't be surprised if other commands could also produce arbitrary output. (Keep in mind multiple commands are supported, so you can't just limit the replace to the second line.)

But even if we punt on cleaning up the output entirely, I still think shell is so much cleaner of a fit here that that's the direction to move in. I think it also eliminates the need for those nested parameters?

(Assuming of course you're OK with just documenting the limitation that this adapter wouldn't accept a custom --prompt if it does indeed work on other people's installations of bq)

I don't think we need to document a purely hypothetical caveat.

mbhynes commented 1 year ago

What about this format=sparse I've heard so much about? I assume that shows raw data starting in column 1? That creates the potential for a false positive. I wouldn't be surprised if other commands could also produce arbitrary output.

This just removes the tabular lines demarcating cells in the the output; for example: vs format=pretty:

I wouldn't be surprised if other commands could also produce arbitrary output. (Keep in mind multiple commands are supported, so you can't just limit the replace to the second line.)

Do you mean sending multiple commands to execute to bigquery? As in, something like bq ls; bq show some.table? I tried running something like that in bq shell to confirm that it won't be executed 👍. If you mean that dadbod can execute multiple lines though, then I'm not sure what the result would be; I'm afraid you'd have to give me an example of the input for that to test it. (Any ; in the input command to bq shell creates an error).

Just to close the loop in the scope of what we're talking about here for bq subcommands (ie not just bq query) executed using bq shell, no matter what command I execute I still get the welcome message printed to stdout first, with the next line starting with the [project]> prompt (and the last line being [project]> Goodbye.). Looks to me like it would be regular enough for regex post-processing in the known 1st, 2nd, and nth lines (although I don't think it's a major blocker if the output isn't cleaned up, at least not at first just to provide something for BQ users).

But even if we punt on cleaning up the output entirely, I still think shell is so much cleaner of a fit here that that's the direction to move in. I think it also eliminates the need for those nested parameters?

Agreed, and yes it does eliminate the need for nesting the subcommand arguments, because either the commands entered into bq shell are stripped-down in terms of options or the --option=value will be parsed by bigquery. For example, I can run bigquery:/// query --priority=BATCH select 1 in dadbod and get the expected result, plus some of the extra cruft expected from --priority=BATCH:

mbhynes commented 1 year ago

@tpope if you're happy with those changes then I'll update this PR accordingly:

remove hardcoding of the global optional parameters
remove any subcommand optional parameters
use #interactive() to wrap bq shell

I'll defer to you on any post-query formatting, as I think that touches the guts of vim-dadbod.

tpope commented 1 year ago

What about this format=sparse I've heard so much about? I assume that shows raw data starting in column 1? That creates the potential for a false positive. I wouldn't be surprised if other commands could also produce arbitrary output.

This just removes the tabular lines demarcating cells in the the output; for example:

Can you show me what the output actually looks like? Edit the text buffer to redact the project name if you must, but no blurring; I can't tell what's whitespace and punctuation when you blur, and whitespace is looking mighty relevant here. If the prompt has zero whitespace before it, and format=sparse has a single whitespace before it, then indeed, that should be unambiguous.

I wouldn't be surprised if other commands could also produce arbitrary output. (Keep in mind multiple commands are supported, so you can't just limit the replace to the second line.)

Do you mean sending multiple commands to execute to bigquery? As in, something like bq ls; bq show some.table? I tried running something like that in bq shell to confirm that it won't be executed +1. If you mean that dadbod can execute multiple lines though, then I'm not sure what the result would be; I'm afraid you'd have to give me an example of the input for that to test it. (Any ; in the input command to bq shell creates an error).

Create a new file, put two query commands on 2 different lines, and use :%DB bigquery://... to send both queries to bq shell.

@tpope if you're happy with those changes then I'll update this PR accordingly:

remove hardcoding of the global optional parameters

remove any subcommand optional parameters

use #interactive() to wrap bq shell

I'll defer to you on any post-query formatting, as I think that touches the guts of vim-dadbod.

I do not consider post-processing a hard requirement for merging. It might make sense to do it as a second commit in this PR, or it might make sense to do a second PR. Either way, you're clear to implement everything in that list. I think we have the basics pinned down there. I may request minor changes that should be easy to squash in.

mbhynes commented 1 year ago

Hi @tpope, my apologies for the delay in getting back to you.

I spent some time this evening updating this PR, and I'm afraid I don't think using the bq shell command is going to work out after all.

The main problem that I've encountered with it after doing further testing is that the bq shell treats newlines as demarcating executable statements, equivalent to how a semicolon would be expected to separate commands for other databases.

As an example, I created this query in a buffer:

SELECT repo_name, sum(watch_count)
FROM `bigquery-public-data.github_repos.sample_repos` 
group by 1
order by 2 desc
limit 10

and then executed it with :%DB bigquery://.

The result is the following:

Welcome to BigQuery! (Type help for more information.)
proj> Error in query string: Error processing job 'proj:bqjob_r520e17e04e018749ca5_1': Unrecognized name: repo_name
at [1:8]
proj> *** Unknown syntax: FROM `bigquery-public-data.github_repos.sample_repos`
proj> *** Unknown syntax: group by 1
proj> *** Unknown syntax: order by 2 desc
proj> *** Unknown syntax: limit 10
proj> Goodbye.

This is pretty unexpected and makes bq shell infeasible IMO---I can't imagine anyone wanting to edit any sizeable query on 1 line, wrapping or no wrapping... FWIW I tested that:

it's also problematic to type a multiline query within the interactive bq shell console
line escapes with trailing \ don't help (the \ is treated as a literal character).

mbhynes commented 1 year ago

Re your other questions, please see below (although I think this is moot because of the no-multiline query issue):

Can you show me what the output actually looks like? Edit the text buffer to redact the project name if you must, but no blurring; I can't tell what's whitespace and punctuation when you blur, and whitespace is looking mighty relevant here. If the prompt has zero whitespace before it, and format=sparse has a single whitespace before it, then indeed, that should be unambiguous.

For sure, please see some examples below (I've done this with a "multi-line" query, ie a query that has 2 statements on 2 separate lines to show the output).

Create a new file, put two query commands on 2 different lines, and use :%DB bigquery://... to send both queries to bq shell.

An example of this is below; File:

select 1, 'a' union all select 2, 'b'
select 'another'

Output of :%DB bigquery://?format=pretty (format=pretty is the default format; it may be unspecified)

Welcome to BigQuery! (Type help for more information.)
proj> +-----+-----+
| f0_ | f1_ |
+-----+-----+
|   1 | a   |
|   2 | b   |
+-----+-----+
proj> +---------+
|   f0_   |
+---------+
| another |
+---------+
proj> Goodbye.

Example output of :%DB bigquery://?format=csv:

Welcome to BigQuery! (Type help for more information.)
proj> f0_,f1_
1,a
2,b
proj> f0_
another
proj> Goodbye.

Example output of :%DB bigquery://?format=sparse

Welcome to BigQuery! (Type help for more information.)
proj>   f0_   f1_  
----- ----- 
1   a    
2   b    
proj>     f0_    
--------- 
another  
proj> Goodbye.

tpope commented 1 year ago

The main problem that I've encountered with it after doing further testing is that the bq shell treats newlines as demarcating executable statements, equivalent to how a semicolon would be expected to separate commands for other databases.

Given that you didn't even know about multi-line queries until I brought up :%DB, is this really a deal-breaker? That is, is passing multi-line SQL queries a popular use-case?

I'd also note that #massage() could potentially be used to replace the newlines. This is trivial if a blind 100% replacement is desired, and possibly still doable for a more complicated, selective replacement.

More thoughts to come.

tpope commented 1 year ago

Given the seeming importance of project_id and dataset_id, I think they should be first class entities in the URL. Since we're not going to be making use of the authority fields like hostname, the format that seems most natural to me would be bigquery:project_id:dataset_id, with bigquery:project_id and bigquery: being acceptable truncations. Query parameters could still be optionally appended for other options. Any reason this doesn't make sense?

(There's also a strong argument for putting api in the hostname/port fields, but that raises questions about what to do about http/https, which doesn't feel worth answering for a feature I would guess has pretty limited usage.)

Setting aside the major question of shell/query, I think that's the last noteworthy change needed to get this merged.

mbhynes commented 1 year ago

Given that you didn't even know about multi-line queries until I brought up :%DB, is this really a deal-breaker? That is, is passing multi-line SQL queries a popular use-case?

Re the popular use case, definitely. Remember bigquery is an OLAP data processing system, so the kinds of queries a typical user runs against it aren't the same as an admin inspecting records in an operational DB or dev updating a tables or records in a development database. The queries one typically writes against it are units of processing logic in data warehouses---they're hundreds of lines with multiple CTEs.

Re my ignorance, please bear with me & understand that I'm not a current dadbod user. However I do have to use the bigquery console occasionally and my typical CLI-based workflow would be to edit queries in files and then use bq query < file.sql to run longer ones, if I'm not using some other kind of tool like dbt that submits the queries. My interest in dadbod is actually a means to an end for a workflow I currently don't have---ideally I'd like to get https://github.com/kristijanhusak/vim-dadbod-ui set up to use bigquery so that I can cut out any interactions I have with the bigquery console.

I'd also note that #massage() could potentially be used to replace the newlines. This is trivial if a blind 100% replacement is desired, and possibly still doable for a more complicated, selective replacement.

More thoughts to come.

Something like this? I tried patching that in but haven't had it change anything when I run % DB on a buffer. If you have some advice here, I'd be happy to hear it, as it's a little hard to navigate the branching logic in db.vim for novice vimscripter to know where this is going awry 🙏

function! db#adapter#bigquery#massage(input) abort
  return substitute(a:input, "\n", " ", "g")
endfunction

Given the seeming importance of project_id and dataset_id, I think they should be first class entities in the URL. Since we're not going to be making use of the authority fields like hostname, the format that seems most natural to me would be bigquery:project_id:dataset_id, with bigquery:project_id and bigquery: being acceptable truncations. Query parameters could still be optionally appended for other options. Any reason this doesn't make sense?

So long as the unspecified truncations are allowed, I think that's an OK for the URL spec. The dataset_id part is probably not going to be too useful for most people unless they have very a specific warehouse structure:

When a user submits a job to BigQuery, they submit it in an execution project which is in general independent of the storage projects in which datasets/tables actually sit
The project_id that one configures when interacting with BQ is that execution project and doesn't have anything to do with the database/schema/table abstraction of other database systems. It's more comparable to a cost centre that you're allowed to bill for running your query. That query may then try to access data in any project/dataset/table, and the authorization is based on the credentials of the user/agent that submits the query (authorization to read data is independent of authorization to create a query)
If the query submitted references tables without the (storage) project/dataset prefix, then BQ will assume that the storage project is the current (execution) project, and can also use a --dataset_id passed on the command line for defaults. Some queries could be made simpler by passing the :dataset_id, but in practice it just depends on the way a user's organization has modelled their tables. Using :dataset_id wouldn't have helped in any company I've worked for, but maybe it would save someone else a few keystrokes

tpope commented 1 year ago

I forgot massage() is only applied to arguments, not to file contents. I think query is back to being the filter() frontrunner. Keep shell for interactive().

To get rid of the auth delay, you could make auth_input() return v:false, and then change the consumer of that function to short-circuit in that case.

If the namespaced parameters are used, then I can drop the global list for any global/subcommand matching, and users can still specify query-scoped parameters. I think it would be better to enable this functionality, as some of them are useful to have on an ad hoc basis, as the data you're working with might be better visualized in particular ways (e.g. getting json for a particular json-encoding column, which happens a fair bit in the wild...).

Since this is back to being relevant, let me just say what you say here is equally true of every database, yet the other adapters have gotten by without it just fine. Let's stick to global arguments.

mbhynes commented 1 year ago

Hi @tpope, thanks very much for all your time and review on this so far. I believe I've addressed everything above:

[x] query is back to being the filter() frontrunner
[x] Keep shell for interactive()
[x] Make auth_input() return v:false, and then change the consumer of that function to short-circuit in that case.
[x] Accept only global args & drop the query-scoped parameters

Would very much appreciate if you'd have time for another review round 🙏

mbhynes commented 1 year ago

Hi @tpope , hoping to bump this in your inbox for when you have some time 🙏

mbhynes commented 1 year ago

Hi @tpope , gentle bump on this when you have some time 🙏

mbhynes commented 1 year ago

Hi @tpope , wondering if you would have a chance to review this again?

tpope commented 1 year ago

Thanks for your patience, I haven't had much time lately. I guess my first thought is that you're documenting Dadbod's own API in some of your comments. There's an argument for that happening, but not inside the BigQuery adapter, or in this PR.

mbhynes commented 1 year ago

Thanks for your patience, I haven't had much time lately.

No worries, thanks for getting back to me 🙏

I guess my first thought is that you're documenting Dadbod's own API in some of your comments. There's an argument for that happening, but not inside the BigQuery adapter, or in this PR.

Sure, I dropped all the doc strings here if you prefer?

mbhynes commented 1 year ago

Hi @tpope I think I addressed your latest comments about the doc strings above; hoping you could put some 👀 on again?

mbhynes commented 1 year ago

Hi @tpope Wondering if you might have some time to review this again? I know it's been a while so trying to bump this on your stack

tpope commented 1 year ago

Okay I took my own crack at this docs. I'm thinking something like this is succinct yet sufficient:

BigQuery ~
>
    bigquery:[<project_id>[:<dataset_id>]][?...]
    bigquery:?project_id=<project_id>&dataset_id=<dataset_id>[&...]
<
The first form of the URL assumes your default dataset belongs to your default
project.  If it doesn't, use the second form.

Additional query parameters can be any valid global flag, for example,
`&disable_ssl_validation=true`.  Note that subcommand flags are not supported
as query parameters, but they can be specified in `~/.bigqueryrc`.

mbhynes commented 1 year ago

SG, I'll make an update to use that, hopefully this weekend

mbhynes commented 1 year ago

Hi @tpope , I've updated the docs as per your advice. Thanks for taking a look at this whenever you have time 🙏

simanga-dev commented 1 year ago

Thank you so much for this... God bless you.

tpope / vim-dadbod

Add BigQuery Adapter #125

Summary

Issues Addressed

Implementation Notes

Questions / Notes for @tpope

(resolved) Context: `bq` authentication

(resolved) Duplicate `bq` Calls

Downstream `vim-dadbod-ui` Usage

Duplicate `bq` Calls

tpope / vim-dadbod

Add BigQuery Adapter #125

Summary

Issues Addressed

Implementation Notes

Questions / Notes for @tpope

(resolved) Context: bq authentication

(resolved) Duplicate bq Calls

Downstream vim-dadbod-ui Usage

Duplicate bq Calls

(resolved) Context: `bq` authentication

(resolved) Duplicate `bq` Calls

Downstream `vim-dadbod-ui` Usage

Duplicate `bq` Calls