Closed mbhynes closed 1 year ago
Hey @mbhynes, thank you so much for your time on this, it's really appreciated (I couldn't find time to do it by my own) ;-)
from the user point of view, I've tried it and it works almost as expected! The only unexpected behaviour I found is when you call to DB without a query :DB bigquery://?format=sparse&use_legacy_sql=false
then the whole screen get blank and seems frozen but a C-c
returns everything back. To be honest not sure what others adapters do in the same situation.
Hey @mbhynes, thank you so much for your time on this, it's really appreciated (I couldn't find time to do it by my own) ;-)
No worries, thanks for giving me a starting point!
from the user point of view, I've tried it and it works almost as expected! The only unexpected behaviour I found is when you call to DB without a query
:DB bigquery://?format=sparse&use_legacy_sql=false
then the whole screen get blank and seems frozen but aC-c
returns everything back.
Thanks for testing this; this is definitely an issue to fix for people who need legacy sql. It might have something to do with how I set a default dictionary of keys to merge with any user supplied values. I'll double check how the extend
function works in vim (I may be too used to merging dictionaries in python and have introduced a bug)
- Invocations of the
bq
CLI are typically authenticated using the user's [or agent's] current google cloud SDK credentials, either directly from variables in the shell environment or from local configuration files- This environment-based authentication means that there isn't typically a need to provide
user/password
credentials in a connection string when connecting to BigQuery, unlike many OLTP databases (and their existing dadbod adapters)
This is all well and good. Just to cover all our bases, when you say "typically", does that mean there's an exception?
This was the root of the problem encountered previously in enh(adapater): add bigquery supportΒ #118, raised in this review thread:
- IIUC the issue stemmed from the
db#connect
calls in whichvim-dadbod
assumes that some credentials need to be passed to the database client CLI to open an interactive session- Since there isn't anything to pass in, this is problematic and causes BQ to read an empty file to the
stdin
(it's expecting a query, not interactive input)Duplicate
bq
Calls
- To address the auth issue and still pass the query using the
stdin
, I've implemented a dummy auth input that passesselect 1
tobq
:
This is indeed the solution intended by the existing design.
- This means that there are 2 round trips made to BQ to execute a query (which does technically confirm that the user can authenticate), and since it's an OLAP database it's pretty damn slow to do this; each query, small or large, has 1-2 seconds overhead
All in all it means that simple queries are twice as slow as they need to be. The extra 2 seconds probably doesn't matter for long-running queries, but this could definitely be improved. However, I don't know how to improve it based on my understanding of the current codebase;
- If the auth step could be skipped, that would be ideal!
- However if skipping it requires using the
#inputs
method that provides a file, this will be problematic forbq
, since it only accepts a comandline string orstdin
; there is AFAIK no way to pass a file according to the docs
Before I dig into the rest of this, can you clarify: If BigQuery hypothetically did support a filename argument, and you could provide a #input
method, would that solve the problem? If so, how? (I'm not seeing how it would change things, but if I'm wrong, that would absolutely inform how I proceed here.)
This is all well and good. Just to cover all our bases, when you say "typically", does that mean there's an exception?
I don't think so---at least not one that matters for this use case. As an aside, it's also common enough to use json credential files that store short-lived keys when applications interact with google cloud using service accounts. But even if a human user wants to authenticate and run queries as a service agent, they can still do the agent auth prior to running bq
and use credentials in the environment.
Before I dig into the rest of this, can you clarify: If BigQuery hypothetically did support a filename argument, and you could provide a
#input
method, would that solve the problem? If so, how? (I'm not seeing how it would change things, but if I'm wrong, that would absolutely inform how I proceed here.)
No I don't believe so. I'm sorry to have implied that; looking again at the db.vim
file and running a few tests I see that there's always a preliminary auth query sent to bigquery, so IIUC the current implementation will always send 2 queries.
Just as an aside, between a solution using #input
or #interactive
, passing the path of the tmpfile containing the query with --flagfile <filename>
option does actually work as intended if the dummy auth select 1;
is provided by #auth_input
, i.e. if the adapter has the following 2 functions:
function! db#adapter#bigquery#auth_input() abort
return 'select 1;'
endfunction
function! db#adapter#bigquery#input(url, in) abort
return s:command_for_url(a:url, 'query') + ['--flagfile', a:in]
endfunction
However between using stdin
or the --flagfile
from the previous PR, I think stdin
is much safer because it's very well documented in both the CLI and on the internet as the intended way to pass a query to bq
if not specified as an inline string. The --flagfile
option is documented as a way to provide commandline flag arguments in a file. It happens to allow a query in a file that doesn't begin with --
to be parsed as a string, but it's not explicitly documented to work this way...
from the user point of view, I've tried it and it works almost as expected! The only unexpected behaviour I found is when you call to DB without a query
:DB bigquery://?format=sparse&use_legacy_sql=false
then the whole screen get blank and seems frozen but aC-c
returns everything back. To be honest not sure what others adapters do in the same situatio
@mmngreco I made the edits to remove the default to standard sql, and sample queries work for me using that adapter address! Could you please try this again on your setup to confirm it's fixed for you?
However between using
stdin
or the--flagfile
from the previous PR, I thinkstdin
is much safer
The only reason to bother with #input
is if it offers a distinct advantage over stdin, for example, error messages that contain the filename, or if stdin isn't supported at all. I think we can safely rule it out here.
The first and most obvious purpose of the auth check call is to detect the need for a password and subsequently prompt for it if necessary. But it also serves a more general purpose of verifying we can access the database at all, and reporting an error if not. So the first question is, does bq have another way to check for successful access that's any faster? If so, we could slot that into the existing design with minimal disruption.
To get rid of the check entirely I need you to test what happens exactly when there's an error (e.g., not signed into Google Cloud SDK) and report back. Try both with the default #auth_pattern
workflow, and then again with it short circuited, like this:
diff --git i/autoload/db.vim w/autoload/db.vim
index 2ba7ab9..04c7095 100644
--- i/autoload/db.vim
+++ w/autoload/db.vim
@@ -321,6 +321,7 @@ function! db#connect(url) abort
if has_key(s:passwords, resolved)
let url = substitute(resolved, '://[^:/@]*\zs@', ':'.db#url#encode(s:passwords[url]).'@', '')
endif
+ return url
let pattern = db#adapter#call(url, 'auth_pattern', [], 'auth\|login')
let input = tempname()
try
Hey @mbhynes, thank you so much for your time on this, it's really appreciated (I couldn't find time to do it by my own) ;-)
from the user point of view, I've tried it and it works almost as expected! The only unexpected behaviour I found is when you call to DB without a query
:DB bigquery://?format=sparse&use_legacy_sql=false
then the whole screen get blank and seems frozen but aC-c
returns everything back. To be honest not sure what others adapters do in the same situation.
What normally happens when you don't give a query is it starts an interactive session. Try :DB sqlite:
to see what I mean. Does BigQuery support anything like this?
What normally happens when you don't give a query is it starts an interactive session. Try
:DB sqlite:
to see what I mean. Does BigQuery support anything like this?
No, unfortunately not.
So the first question is, does bq have another way to check for successful access that's any faster? If so, we could slot that into the existing design with minimal disruption.
Possibly; there's a flag that tells bq to validate but not run a query. I can look at this and dig into the docs for bq or gcloud later to see if there's a way to do this that cuts down on the round-trip time.
To get rid of the check entirely I need you to test what happens exactly when there's an error (e.g., not signed into Google Cloud SDK) and report back. Try both with the default
#auth_pattern
workflow, and then again with it short circuited, like this:diff --git i/autoload/db.vim w/autoload/db.vim index 2ba7ab9..04c7095 100644 --- i/autoload/db.vim +++ w/autoload/db.vim @@ -321,6 +321,7 @@ function! db#connect(url) abort if has_key(s:passwords, resolved) let url = substitute(resolved, '://[^:/@]*\zs@', ':'.db#url#encode(s:passwords[url]).'@', '') endif + return url let pattern = db#adapter#call(url, 'auth_pattern', [], 'auth\|login') let input = tempname() try
Thanks for this suggestion; I'll take a look and get back to you with more info.
What normally happens when you don't give a query is it starts an interactive session. Try
:DB sqlite:
to see what I mean. Does BigQuery support anything like this?No, unfortunately not.
If interactive sessions aren't supported, then #interactive
should not be provided. Instead, rename it to #filter
.
To get rid of the check entirely I need you to test what happens exactly when there's an error (e.g., not signed into Google Cloud SDK) and report back. Try both with the default
#auth_pattern
workflow, and then again with it short circuited, like this [...]
Hi Tim, thanks for suggesting this, and please find the results below:
Bad query with dummy #auth_input()
:
project_id
that is invalid or to which I don't have access: :DB bigquery:///?format=sparse&project_id=bad select * from
region-us.INFORMATION_SCHEMA.TABLES;
bq
, displayed in red at the bottom of the window:
Bad query with the above diff applied:
git apply
and then ran the same queryThe second case is what happens normally off the head of this branch (without the diff applied) if I run an invalid query. Using a typo in the query on purpose, I get the same result as case 2 (with a different error message of course), which makes sense since the dummy auth select 1
query will have succeeded, allowing the 2nd query with the typo to run:
:DB bigquery:/// select * from `region-us`.INFORMATION_SCHEMA.TABLESs;
Possibly; there's a flag that tells bq to validate but not run a query. I can look at this and dig into the docs for bq or gcloud later to see if there's a way to do this that cuts down on the round-trip time.
This doesn't look promising; there's a flag to validate the query but bq query --dry_run 'select 1'
takes just as long for me as as bq query 'select 1'
What about other commands, like bq show
or bq list
?
Unrelated to anything above, I would like to push back against the use of query parameters here. URLs are used by Dadbod to encapsulate connection parameters. They are extremely well suited to the purpose: user, password, host, and port map cleanly onto 4 common database parameters; the database name conceptually fits in the path field; and database-specific odds-and-ends can be slotted into arbitrary query parameters. BigQuery, lacking the big 4 parameters, is an outlier that doesn't really benefit from the use of a URL, but that's fine, so is SQLite and we still support that.
But the important part here is that URLs are for "connection parameters". That covers things like --api=https://example.com
(side note: maybe host is a valid parameter after all?) and --ca_certificates_file=PATH
. It absolutely does not cover --format=sparse
. I think the Dadbod position would be if you want --format=sparse
, you should put it in .bigqueryrc
, and if you want it on a per-query basis, well, there isn't a provided solution for that. Sticking it in the query parameters goes against the grain because the whole design is built around reusing URLs from request to request (e.g., by saving them in variables), not constantly tweaking them for presentation reasons.
At most, I would be willing to tolerate allowing arbitrary query parameters for arbitrary command-line options, as this may indeed be better than nothing, but the examples in the documentation should be limited to officially sanctioned connection parameters.
My objections don't end there. BigQuery accepts 2 different kinds of command-line options: global options, and subcommand options. These are distinguished by their position either before or after the name of the subcommand. The adapter handles these two cases by accepting them all mixed together in the same namespace, and then teasing them apart with hard-coded a list of every single global option. I'm not even going to bother explaining why this is unacceptable. If we really must accept not just global but also subcommand options, then they should be in different namespaces. My proposal would be for subcommand options to follow the common []
practice for nested parameters, for example ?query[use_legacy_sql]=true
. Although dropping subcommand options entirely seems like a reasonable approach too: None of them look like connection parameters to me.
What about other commands, like
bq show
orbq list
?
Those commands will still have the extra few seconds overhead, and won't strictly speaking be able to guarantee that a subsequent query can be run successfully because of nature of IAM permissions for user accounts. A user may be able to successfully list a dataset's tables, but that doesn't necessarily mean that the user can query those tables or launch a job. The access control is fine-grained and depends on the policies of the org you're working in.
For more context too, it's pretty seldom that a user has to actually authenticate to bigquery. You might run gcloud auth login
when you need to set the user account you're working with and obtain an access token, but unless you change that user, you probably won't change that token or re-run the auth for a long time. The token authentication is pretty different than authentication to a lot of OLTP dbs
Unrelated to anything above, I would like to push back against the use of query parameters here. URLs are used by Dadbod to encapsulate connection parameters. They are extremely well suited to the purpose: user, password, host, and port map cleanly onto 4 common database parameters; the database name conceptually fits in the path field; and database-specific odds-and-ends can be slotted into arbitrary query parameters. BigQuery, lacking the big 4 parameters, is an outlier that doesn't really benefit from the use of a URL, but that's fine, so is SQLite and we still support that.
But the important part here is that URLs are for "connection parameters". That covers things like
--api=https://example.com
(side note: maybe host is a valid parameter after all?) and--ca_certificates_file=PATH
. It absolutely does not cover--format=sparse
. I think the Dadbod position would be if you want--format=sparse
, you should put it in.bigqueryrc
, and if you want it on a per-query basis, well, there isn't a provided solution for that. Sticking it in the query parameters goes against the grain because the whole design is built around reusing URLs from request to request (e.g., by saving them in variables), not constantly tweaking them for presentation reasons.At most, I would be willing to tolerate allowing arbitrary query parameters for arbitrary command-line options, as this may indeed be better than nothing, but the examples in the documentation should be limited to officially sanctioned connection parameters.
What do you mean by officially sanctioned ? As in, CLI options documented by e.g. bq query --help
?
My objections don't end there. BigQuery accepts 2 different kinds of command-line options: global options, and subcommand options. These are distinguished by their position either before or after the name of the subcommand. The adapter handles these two cases by accepting them all mixed together in the same namespace, and then teasing them apart with hard-coded a list of every single global option. I'm not even going to bother explaining why this is unacceptable. If we really must accept not just global but also subcommand options, then they should be in different namespaces. My proposal would be for subcommand options to follow the common
[]
practice for nested parameters, for example?query[use_legacy_sql]=true
. Although dropping subcommand options entirely seems like a reasonable approach too: None of them look like connection parameters to me.
Sure, do you have an existing utility that parses the nested parts of the connection string already in the repo? If you wouldn't mind sharing an example usage if that exists I'd much appreciate it π
If the namespaced parameters are used, then I can drop the global list for any global/subcommand matching, and users can still specify query-scoped parameters. I think it would be better to enable this functionality, as some of them are useful to have on an ad hoc basis, as the data you're working with might be better visualized in particular ways (e.g. getting json
for a particular json
-encoding column, which happens a fair bit in the wild...). Sound like a plan?
At most, I would be willing to tolerate allowing arbitrary query parameters for arbitrary command-line options, as this may indeed be better than nothing, but the examples in the documentation should be limited to officially sanctioned connection parameters.
What do you mean by officially sanctioned ? As in, CLI options documented by e.g.
bq query --help
?
I just mean docs should avoid examples that include non-connection parameters, like format
.
My objections don't end there. BigQuery accepts 2 different kinds of command-line options: global options, and subcommand options. These are distinguished by their position either before or after the name of the subcommand. The adapter handles these two cases by accepting them all mixed together in the same namespace, and then teasing them apart with hard-coded a list of every single global option. I'm not even going to bother explaining why this is unacceptable. If we really must accept not just global but also subcommand options, then they should be in different namespaces. My proposal would be for subcommand options to follow the common
[]
practice for nested parameters, for example?query[use_legacy_sql]=true
. Although dropping subcommand options entirely seems like a reasonable approach too: None of them look like connection parameters to me.Sure, do you have an existing utility that parses the nested parts of the connection string already in the repo? If you wouldn't mind sharing an example usage if that exists I'd much appreciate it pray
Nothing exists for this, but not too hard to throw something together:
function! s:build_cmd(subcmd, params) abort
let global = []
let sub = []
for [k, v] in items(a:params)
let match = matchlist(k, '^\([^[]\+\)\[\(.*\)\]$')
if empty(match)
call add(global, '--' . k . '=' .v)
elseif match[1] ==# a:subcmd
call add(sub, '--' . match[1] . '=' .match[2])
endif
endfor
return ['bq'] + global + [a:subcmd] + sub
endfunction
Usage: let argv = s:build_cmd('query', url.params)
From earlier:
What normally happens when you don't give a query is it starts an interactive session. Try
:DB sqlite:
to see what I mean. Does BigQuery support anything like this?No, unfortunately not.
What does bq shell
do?
Hi Tim, sorry for the delay in getting back to you, haven't had a chance to come back to this during the week.
What do you mean by officially sanctioned ?
I just mean docs should avoid examples that include non-connection parameters, like
format
.
Sounds good, thanks for clarifying π
Sure, do you have an existing utility that parses the nested parts of the connection string already in the repo? If you wouldn't mind sharing an example usage if that exists I'd much appreciate it pray
Nothing exists for this, but not too hard to throw something together:
That's very much appreciated, thank you!
From earlier:
What normally happens when you don't give a query is it starts an interactive session. Try
:DB sqlite:
to see what I mean. Does BigQuery support anything like this?No, unfortunately not.
What does
bq shell
do?
Well this is embarrassing, but I'm afraid I didn't even know bq shell
existed ... I should have looked through the bq
commands more closely, but was focused on the bq query
subcommand a bit too myopically! π
Thank you for the bq shell
suggestion; I've given it a quick test in this branch by removing the #auth_input
and replacing the query
with shell
in #filter
as;
function! db#adapter#bigquery#filter(url) abort
return s:command_for_url(a:url, 'shell')
endfunction
This is pretty promising, because it:
bq
subcommands like bq ls
The only disadvantage that I see is that the stdout from the bq shell
command includes the console characters like the following (it has 1 extra line before and after the actual query output, as a well as a [project]>>>
prompt that I've blurred from this screenshot):
All in all, that's a pretty minor drawback I think, and removing the extra select 1
query is way better for the user experience than a few extra output lines IMO. If there's an adapter function that applies formatting to the command output that would be a simple fix? I'm afraid nothing is jumping out at me in the existing adapters.
If you're OK with the extra lines in the output, I'll patch this PR and remove the global options as discussed above.
You can go back from #filter
to #interactive
, now that you have a command that works interactively.
Regarding the cruft in the output, have you tried bq --headless=true shell
? bq --quiet=true shell
? Looking for other options in bq shell --help
?
Regarding the cruft in the output, have you tried bq --headless=true shell? bq --quiet=true shell? Looking for other options in bq shell --help?
I've gone through the globals documented in bq shell --help
and afraid I haven't managed to get rid of the prompt message. In this case for the bq shell
, I don't think there's a flag to silence it; I went looking for the source script on my system to see where it's coming from, and in my google-cloud-sdk
it's from a stdout
print line in google-cloud-sdk/platform/bq/bq.py
.
(Found an unofficial copy of the bq.py
here) as well for reference)
A few lines up suggests there's a --prompt
option. Can that be used to silence the prompt? The welcome message is noisy, but it's the prompt that really makes things janky. (And the prompt would be much harder to strip out in a post-processing step.)
A few lines up suggests there's a
--prompt
option. Can that be used to silence the prompt?
I did try playing with the --prompt
option, but no luck (actually I didn't even see any effect from changing it, even independent of dadbod with e.g. bq shell --prompt='notempty'
; it still used my default gcp project as a prompt, e.g. project>
. Based on the code screenshot above though, I don't think there's any way to get rid of the welcome message, since it's printed to stdout independently of the REPL loop that that prompt
parameter is set for...
The welcome message is noisy, but it's the prompt that really makes things janky. (And the prompt would be much harder to strip out in a post-processing step.)
I tried with a few different project ids and it seems to be consistent in that it prints out [project]>
to the stdout. A project identifier has to have only alphanumeric or hyphen characters (docs), so post-processing the default prompt [project]>
rule could be manageable with a regex. WDYT? (Assuming of course you're OK with just documenting the limitation that this adapter wouldn't accept a custom --prompt
if it does indeed work on other people's installations of bq
)
I tried with a few different project ids and it seems to be consistent in that it prints out
[project]>
to the stdout. A project identifier has to have only alphanumeric or hyphen characters (docs), so post-processing the default prompt[project]>
rule could be manageable with a regex. WDYT?
What about this format=sparse
I've heard so much about? I assume that shows raw data starting in column 1? That creates the potential for a false positive. I wouldn't be surprised if other commands could also produce arbitrary output. (Keep in mind multiple commands are supported, so you can't just limit the replace to the second line.)
But even if we punt on cleaning up the output entirely, I still think shell
is so much cleaner of a fit here that that's the direction to move in. I think it also eliminates the need for those nested parameters?
(Assuming of course you're OK with just documenting the limitation that this adapter wouldn't accept a custom
--prompt
if it does indeed work on other people's installations ofbq
)
I don't think we need to document a purely hypothetical caveat.
What about this
format=sparse
I've heard so much about? I assume that shows raw data starting in column 1? That creates the potential for a false positive. I wouldn't be surprised if other commands could also produce arbitrary output.
This just removes the tabular lines demarcating cells in the the output; for example:
vs format=pretty
:
I wouldn't be surprised if other commands could also produce arbitrary output. (Keep in mind multiple commands are supported, so you can't just limit the replace to the second line.)
Do you mean sending multiple commands to execute to bigquery? As in, something like bq ls; bq show some.table
? I tried running something like that in bq shell
to confirm that it won't be executed π. If you mean that dadbod
can execute multiple lines though, then I'm not sure what the result would be; I'm afraid you'd have to give me an example of the input for that to test it. (Any ;
in the input command to bq shell
creates an error).
Just to close the loop in the scope of what we're talking about here for bq
subcommands (ie not just bq query
) executed using bq shell
, no matter what command I execute I still get the welcome message printed to stdout first, with the next line starting with the [project]>
prompt (and the last line being [project]> Goodbye.
). Looks to me like it would be regular enough for regex post-processing in the known 1st, 2nd, and nth lines (although I don't think it's a major blocker if the output isn't cleaned up, at least not at first just to provide something for BQ users).
But even if we punt on cleaning up the output entirely, I still think
shell
is so much cleaner of a fit here that that's the direction to move in. I think it also eliminates the need for those nested parameters?
Agreed, and yes it does eliminate the need for nesting the subcommand arguments, because either the commands entered into bq shell
are stripped-down in terms of options or the --option=value
will be parsed by bigquery. For example, I can run bigquery:/// query --priority=BATCH select 1
in dadbod
and get the expected result, plus some of the extra cruft expected from --priority=BATCH
:
@tpope if you're happy with those changes then I'll update this PR accordingly:
#interactive()
to wrap bq shell
I'll defer to you on any post-query formatting, as I think that touches the guts of vim-dadbod
.
What about this
format=sparse
I've heard so much about? I assume that shows raw data starting in column 1? That creates the potential for a false positive. I wouldn't be surprised if other commands could also produce arbitrary output.This just removes the tabular lines demarcating cells in the the output; for example:
Can you show me what the output actually looks like? Edit the text buffer to redact the project name if you must, but no blurring; I can't tell what's whitespace and punctuation when you blur, and whitespace is looking mighty relevant here. If the prompt has zero whitespace before it, and format=sparse
has a single whitespace before it, then indeed, that should be unambiguous.
I wouldn't be surprised if other commands could also produce arbitrary output. (Keep in mind multiple commands are supported, so you can't just limit the replace to the second line.)
Do you mean sending multiple commands to execute to bigquery? As in, something like
bq ls; bq show some.table
? I tried running something like that inbq shell
to confirm that it won't be executed +1. If you mean thatdadbod
can execute multiple lines though, then I'm not sure what the result would be; I'm afraid you'd have to give me an example of the input for that to test it. (Any;
in the input command tobq shell
creates an error).
Create a new file, put two query
commands on 2 different lines, and use :%DB bigquery://...
to send both queries to bq shell
.
@tpope if you're happy with those changes then I'll update this PR accordingly:
- remove hardcoding of the global optional parameters
- remove any subcommand optional parameters
- use
#interactive()
to wrapbq shell
I'll defer to you on any post-query formatting, as I think that touches the guts of
vim-dadbod
.
I do not consider post-processing a hard requirement for merging. It might make sense to do it as a second commit in this PR, or it might make sense to do a second PR. Either way, you're clear to implement everything in that list. I think we have the basics pinned down there. I may request minor changes that should be easy to squash in.
Hi @tpope, my apologies for the delay in getting back to you.
I spent some time this evening updating this PR, and I'm afraid I don't think using the bq shell
command is going to work out after all.
The main problem that I've encountered with it after doing further testing is that the bq shell
treats newlines as demarcating executable statements, equivalent to how a semicolon would be expected to separate commands for other databases.
As an example, I created this query in a buffer:
SELECT repo_name, sum(watch_count)
FROM `bigquery-public-data.github_repos.sample_repos`
group by 1
order by 2 desc
limit 10
and then executed it with :%DB bigquery://
.
The result is the following:
Welcome to BigQuery! (Type help for more information.)
proj> Error in query string: Error processing job 'proj:bqjob_r520e17e04e018749ca5_1': Unrecognized name: repo_name
at [1:8]
proj> *** Unknown syntax: FROM `bigquery-public-data.github_repos.sample_repos`
proj> *** Unknown syntax: group by 1
proj> *** Unknown syntax: order by 2 desc
proj> *** Unknown syntax: limit 10
proj> Goodbye.
This is pretty unexpected and makes bq shell
infeasible IMO---I can't imagine anyone wanting to edit any sizeable query on 1 line, wrapping or no wrapping...
FWIW I tested that:
bq shell
console\
don't help (the \
is treated as a literal character).Re your other questions, please see below (although I think this is moot because of the no-multiline query issue):
Can you show me what the output actually looks like? Edit the text buffer to redact the project name if you must, but no blurring; I can't tell what's whitespace and punctuation when you blur, and whitespace is looking mighty relevant here. If the prompt has zero whitespace before it, and
format=sparse
has a single whitespace before it, then indeed, that should be unambiguous.
For sure, please see some examples below (I've done this with a "multi-line" query, ie a query that has 2 statements on 2 separate lines to show the output).
Create a new file, put two
query
commands on 2 different lines, and use:%DB bigquery://...
to send both queries tobq shell
.
An example of this is below; File:
select 1, 'a' union all select 2, 'b'
select 'another'
Output of :%DB bigquery://?format=pretty
(format=pretty
is the default format; it may be unspecified)
Welcome to BigQuery! (Type help for more information.)
proj> +-----+-----+
| f0_ | f1_ |
+-----+-----+
| 1 | a |
| 2 | b |
+-----+-----+
proj> +---------+
| f0_ |
+---------+
| another |
+---------+
proj> Goodbye.
Example output of :%DB bigquery://?format=csv
:
Welcome to BigQuery! (Type help for more information.)
proj> f0_,f1_
1,a
2,b
proj> f0_
another
proj> Goodbye.
Example output of :%DB bigquery://?format=sparse
Welcome to BigQuery! (Type help for more information.)
proj> f0_ f1_
----- -----
1 a
2 b
proj> f0_
---------
another
proj> Goodbye.
The main problem that I've encountered with it after doing further testing is that the
bq shell
treats newlines as demarcating executable statements, equivalent to how a semicolon would be expected to separate commands for other databases.
Given that you didn't even know about multi-line queries until I brought up :%DB
, is this really a deal-breaker? That is, is passing multi-line SQL queries a popular use-case?
I'd also note that #massage()
could potentially be used to replace the newlines. This is trivial if a blind 100% replacement is desired, and possibly still doable for a more complicated, selective replacement.
More thoughts to come.
Given the seeming importance of project_id
and dataset_id
, I think they should be first class entities in the URL. Since we're not going to be making use of the authority fields like hostname, the format that seems most natural to me would be bigquery:project_id:dataset_id
, with bigquery:project_id
and bigquery:
being acceptable truncations. Query parameters could still be optionally appended for other options. Any reason this doesn't make sense?
(There's also a strong argument for putting api
in the hostname/port fields, but that raises questions about what to do about http
/https
, which doesn't feel worth answering for a feature I would guess has pretty limited usage.)
Setting aside the major question of shell
/query
, I think that's the last noteworthy change needed to get this merged.
Given that you didn't even know about multi-line queries until I brought up
:%DB
, is this really a deal-breaker? That is, is passing multi-line SQL queries a popular use-case?
Re the popular use case, definitely. Remember bigquery is an OLAP data processing system, so the kinds of queries a typical user runs against it aren't the same as an admin inspecting records in an operational DB or dev updating a tables or records in a development database. The queries one typically writes against it are units of processing logic in data warehouses---they're hundreds of lines with multiple CTEs.
Re my ignorance, please bear with me & understand that I'm not a current dadbod user. However I do have to use the bigquery console occasionally and my typical CLI-based workflow would be to edit queries in files and then use bq query < file.sql
to run longer ones, if I'm not using some other kind of tool like dbt
that submits the queries. My interest in dadbod
is actually a means to an end for a workflow I currently don't have---ideally I'd like to get https://github.com/kristijanhusak/vim-dadbod-ui set up to use bigquery so that I can cut out any interactions I have with the bigquery console.
I'd also note that
#massage()
could potentially be used to replace the newlines. This is trivial if a blind 100% replacement is desired, and possibly still doable for a more complicated, selective replacement.More thoughts to come.
Something like this? I tried patching that in but haven't had it change anything when I run % DB
on a buffer. If you have some advice here, I'd be happy to hear it, as it's a little hard to navigate the branching logic in db.vim
for novice vimscripter to know where this is going awry π
function! db#adapter#bigquery#massage(input) abort
return substitute(a:input, "\n", " ", "g")
endfunction
Given the seeming importance of project_id and dataset_id, I think they should be first class entities in the URL. Since we're not going to be making use of the authority fields like hostname, the format that seems most natural to me would be bigquery:project_id:dataset_id, with bigquery:project_id and bigquery: being acceptable truncations. Query parameters could still be optionally appended for other options. Any reason this doesn't make sense?
So long as the unspecified truncations are allowed, I think that's an OK for the URL spec. The dataset_id
part is probably not going to be too useful for most people unless they have very a specific warehouse structure:
project_id
that one configures when interacting with BQ is that execution project and doesn't have anything to do with the database/schema/table abstraction of other database systems. It's more comparable to a cost centre that you're allowed to bill for running your query. That query may then try to access data in any project/dataset/table, and the authorization is based on the credentials of the user/agent that submits the query (authorization to read data is independent of authorization to create a query)--dataset_id
passed on the command line for defaults. Some queries could be made simpler by passing the :dataset_id
, but in practice it just depends on the way a user's organization has modelled their tables. Using :dataset_id
wouldn't have helped in any company I've worked for, but maybe it would save someone else a few keystrokesI forgot massage()
is only applied to arguments, not to file contents. I think query
is back to being the filter()
frontrunner. Keep shell
for interactive()
.
To get rid of the auth delay, you could make auth_input()
return v:false
, and then change the consumer of that function to short-circuit in that case.
If the namespaced parameters are used, then I can drop the global list for any global/subcommand matching, and users can still specify query-scoped parameters. I think it would be better to enable this functionality, as some of them are useful to have on an ad hoc basis, as the data you're working with might be better visualized in particular ways (e.g. getting
json
for a particularjson
-encoding column, which happens a fair bit in the wild...).
Since this is back to being relevant, let me just say what you say here is equally true of every database, yet the other adapters have gotten by without it just fine. Let's stick to global arguments.
Hi @tpope, thanks very much for all your time and review on this so far. I believe I've addressed everything above:
query
is back to being the filter()
frontrunnershell
for interactive()
auth_input()
return v:false
, and then change the consumer of that function to short-circuit in that case. Would very much appreciate if you'd have time for another review round π
Hi @tpope , hoping to bump this in your inbox for when you have some time π
HiΒ @tpopeΒ , gentle bump on this when you have some timeΒ π
Hi @tpope , wondering if you would have a chance to review this again?
Thanks for your patience, I haven't had much time lately. I guess my first thought is that you're documenting Dadbod's own API in some of your comments. There's an argument for that happening, but not inside the BigQuery adapter, or in this PR.
Thanks for your patience, I haven't had much time lately.
No worries, thanks for getting back to me π
I guess my first thought is that you're documenting Dadbod's own API in some of your comments. There's an argument for that happening, but not inside the BigQuery adapter, or in this PR.
Sure, I dropped all the doc strings here if you prefer?
Hi @tpope I think I addressed your latest comments about the doc strings above; hoping you could put some π on again?
Hi @tpope Wondering if you might have some time to review this again? I know it's been a while so trying to bump this on your stack
Okay I took my own crack at this docs. I'm thinking something like this is succinct yet sufficient:
BigQuery ~
>
bigquery:[<project_id>[:<dataset_id>]][?...]
bigquery:?project_id=<project_id>&dataset_id=<dataset_id>[&...]
<
The first form of the URL assumes your default dataset belongs to your default
project. If it doesn't, use the second form.
Additional query parameters can be any valid global flag, for example,
`&disable_ssl_validation=true`. Note that subcommand flags are not supported
as query parameters, but they can be specified in `~/.bigqueryrc`.
SG, I'll make an update to use that, hopefully this weekend
Hi @tpope , I've updated the docs as per your advice. Thanks for taking a look at this whenever you have time π
Thank you so much for this... God bless you.
Summary
bq
CLI toolIssues Addressed
Implementation Notes
bq query
command and passes the buffer query tostdin
Questions / Notes for @tpope
(resolved) Context:
bq
authenticationbq
CLI are typically authenticated using the user's [or agent's] current google cloud SDK credentials, either directly from variables in the shell environment or from local configuration filesuser/password
credentials in a connection string when connecting to BigQuery, unlike many OLTP databases (and their existing dadbod adapters)db#connect
calls in whichvim-dadbod
assumes that some credentials need to be passed to the database client CLI to open an interactive sessionstdin
(it's expecting a query, not interactive input)(resolved) Duplicate
bq
CallsTo address the auth issue and still pass the query using the
stdin
, I've implemented a dummy auth input that passesselect 1
tobq
:This means that there are 2 round trips made to BQ to execute a query (which does technically confirm that the user can authenticate), and since it's an OLAP database it's pretty damn slow to do this; each query, small or large, has 1-2 seconds overhead
All in all it means that simple queries are twice as slow as they need to be. The extra 2 seconds probably doesn't matter for long-running queries, but this could definitely be improved. However, I don't know how to improve it based on my understanding of the current codebase;
#inputs
method that provides a file, this will be problematic forbq
, since it only accepts a comandline string orstdin
; there is AFAIK no way to pass a file according to the docsDownstream
vim-dadbod-ui
Usagedadbod-ui
as per https://github.com/kristijanhusak/vim-dadbod-ui/pull/160