hafenkran / duckdb-bigquery

Integrates DuckDB with Google BigQuery, allowing direct querying and management of BigQuery datasets
MIT License
58 stars 3 forks source link

Add bigquery secret parsing for authentication #18

Open Kayrnt opened 2 months ago

Kayrnt commented 2 months ago

Closes #8

Still work in progress

Kayrnt commented 2 months ago

@hafenkran, I made some progress today, I've parsed and added to the "bigquery client", the "auth info". I need to add to all relevant GCP options something like:

.set<google::cloud::UnifiedCredentialsOption>(
        google::cloud::MakeServiceAccountCredentials(service_account_json));

before the MakeBigQueryReadConnection (and probably more for metadata calls). I'll try to deal with that by the end of the weekend.

Kayrnt commented 2 months ago

Alright, I finished the implementation of #9 too (it was handy to test the secret support too)

I only need to add integration tests.

hafenkran commented 2 months ago

Hey @Kayrnt, I'm back and just saw your PR. Thanks! I'll have a closer look later I guess. Any progress on the missing integration tests?

github-christophe-oudar commented 2 months ago

Not yet 🙄 Maybe this WE but I'm not sure yet how I'll do it since it would require one JSON secret prepared... I guess it would require you to have that secret set up in the github actions as well for them to pass right?

I sneaked the execution_project config as well and that one is pretty easy though 👍

hafenkran commented 2 months ago

The tests in the github actions are already running with my secrets. For the billing project case, I guess for now it is sufficient if you provide a test and use the require-env logic for the billing project. If this env variable isn't provided, it will get automatically skipped.

github-christophe-oudar commented 2 months ago

Ok I'll have a look later today or during the WE, I've a busy backlog :)

hafenkran commented 2 months ago

No pressure

Kayrnt commented 2 months ago
===============================================================================
All tests passed (4 assertions in 1 test case)

All good on my end for those new tests. I didn't run all the other, I didn't have the time 🙈 Maybe another day (or you might want to check too!)

Kayrnt commented 2 months ago

@hafenkran any update regarding reviewing that PR?

hafenkran commented 1 month ago

@Kayrnt: Now – finally! Thank you very much for your proposal and your effort in the implementation. I have to apologize for the delay in reviewing – I was unfortunately temporarily unavailable over the past few weeks. I have a few comments though:

BigQuery Secrets:

Execution Project:

Lastly, I’d appreciate it if future feature PRs are fully tested with some coverage. It’s really important to ensure thorough testing to avoid missing potential issues. Unfortunately, I won’t be able to take over the testing for you.

Hope this helps! ;)

Kayrnt commented 1 month ago

Hello @hafenkran,

Thank you for picking up the PR.

I had a different approach in mind, as we discussed earlier in the ticket. It wasn't about simply replacing a well-functioning method with another one, but rather about exploring alternatives on how auth could be used. I'm generally not a big fan of adding a wide range of features that may be redundant. I tend to prioritize simplicity to reduce maintenance as much as possible.

This feature introduces a way to authenticate other service account from your own account (no matter if you have the right to impersonate it or not). It's a convenient way to test too. If you had another idea in mind, it's not clear what you would do in our discussion in #8

Have you considered any other approaches for authentication? We could use gcloud auth activate-service-account but it would need to have access to the disk to push the JSON. Whereas secrets, allow to use them within a shell and in SQL only or through environment variables.

The test is quite minimal and doesn’t fully cover different scenarios. I also believe it’s important to include edge cases and situations that should fail, to ensure the implementation is robust.

Yes it could be more but I didn't want to overload the test that should probably do the same things as the regular auth as the tests are very slow already. We could add failures like "invalid JSON" or "forbidden" access for instance.

I see that you've started working on the execution project, but unfortunately, the implementation is incomplete. For example, DML and DDL queries are not covered, and critical functions like bigquery_scan and others aren’t integrated.

Fair point, I didn't doublecheck all cases (especially DDL/DML) and wanted to have a first review to see if we agreed on the way to implement it.

E.g. bigquery_scan: The current implementation isn't designed to work with an execution project in its current form, because the execution project's data isn't read anywhere in the code, and the current way the bigquery_scan function operates doesn't allow passing an execution project. Instead, simple project/dataset/table strings are used, This is also why public datasets can't be accessed at the moment.

As you can see on #9, I only mentioned ATTACH support in the scope of that change as I focused on the "storage approach" of the FDW. I personally feel like those table functions to be a bit awkward but I understand you would like to support it but it means that it requires to align on the approach which we didn't do so far (e.g. like adding optional parameters to the table function).

I also noticed that there’s a lack of proper tests for this part of the code.

It is tested as part of the new test file but indeed I can add all kind of statements to improve the coverage very quickly.

(Unfortunately, I need to take over this task right now since it has become critical for my current work).

So you're implementing that feature in another PR? I see you added billing_project_id the latest PR but it doesn't look like linked to anything yet.

Lastly, I’d appreciate it if future feature PRs are fully tested with some coverage.

Sure, what do you think about adding BigQuery emulator for testing to speed up testing? The testing execution is very tedious with BigQuery.

It’s really important to ensure thorough testing to avoid missing potential issues. Unfortunately, I won’t be able to take over the testing for you.

I agree with you: good coverage is important. For instance, the coverage on ARRAYs of STRUCT is missing and it fails with Failed to cast value: Unimplemented type for cast (STRUCT(a BIGINT, b BIGINT) -> STRUCT(a BIGINT, b BIGINT)[]. It's probably worth opening an issue if it's expected to work. Let's agree on the test cases next time to avoid any misunderstanding.

Now regarding this PR? How would you like to move forward? Would you prefer to take over the execution project part in another PR and rediscuss the secrets part?

hafenkran commented 1 month ago

Hey @Kayrnt, since it is already late, I'll be pretty short and concise.

Hello @hafenkran,

Thank you for picking up the PR.

I had a different approach in mind, as we discussed earlier in the ticket. It wasn't about simply replacing a well-functioning method with another one, but rather about exploring alternatives on how auth could be used. I'm generally not a big fan of adding a wide range of features that may be redundant. I tend to prioritize simplicity to reduce maintenance as much as possible.

This feature introduces a way to authenticate other service account from your own account (no matter if you have the right to impersonate it or not). It's a convenient way to test too. If you had another idea in mind, it's not clear what you would do in our discussion in #8

In the ticket, I guess, I made clear that I don't see the benefit. I just don't wanted to have a 1-to-1 replacement for the auth in #8 without any other added value.

Have you considered any other approaches for authentication? We could use gcloud auth activate-service-account but it would need to have access to the disk to push the JSON. Whereas secrets, allow to use them within a shell and in SQL only or through environment variables.

I also mentioned access tokens as an example in #8.

The test is quite minimal and doesn’t fully cover different scenarios. I also believe it’s important to include edge cases and situations that should fail, to ensure the implementation is robust.

Yes it could be more but I didn't want to overload the test that should probably do the same things as the regular auth as the tests are very slow already. We could add failures like "invalid JSON" or "forbidden" access for instance.

All tests take about 2 minutes to execute for me. Is it different for you?

E.g. bigquery_scan: The current implementation isn't designed to work with an execution project in its current form, because the execution project's data isn't read anywhere in the code, and the current way the bigquery_scan function operates doesn't allow passing an execution project. Instead, simple project/dataset/table strings are used, This is also why public datasets can't be accessed at the moment.

As you can see on #9, I only mentioned ATTACH support in the scope of that change as I focused on the "storage approach" of the FDW. I personally feel like those table functions to be a bit awkward but I understand you would like to support it but it means that it requires to align on the approach which we didn't do so far (e.g. like adding optional parameters to the table function).

Understood. However, for the scan it is important. Otherwise, there is no way to read from public datasets (AFAIK). For me personally, this is by far the most important use case.

I also noticed that there’s a lack of proper tests for this part of the code.

It is tested as part of the new test file but indeed I can add all kind of statements to improve the coverage very quickly.

Aah, I see. But the test is then doing no assertion in this direction. Is that right?

(Unfortunately, I need to take over this task right now since it has become critical for my current work).

So you're implementing that feature in another PR? I see you added billing_project_id the latest PR but it doesn't look like linked to anything yet.

Yes.

Lastly, I’d appreciate it if future feature PRs are fully tested with some coverage.

Sure, what do you think about adding BigQuery emulator for testing to speed up testing? The testing execution is very tedious with BigQuery.

I already use the BigQuery emulator for testing. All files in test/sql/local/ are written for the emulator. Unfortunately, the emulator isn't BigQuery and it doesn't behave like BigQuery. Types are sometimes treated differently; Nulls are different; Storage API behaves differently and doesn't support all modes. I also had to implement a couple of bugfixes for the emulator myself to get the basics up and running. If you want, I can provide a docker image.

It’s really important to ensure thorough testing to avoid missing potential issues. Unfortunately, I won’t be able to take over the testing for you.

I agree with you: good coverage is important. For instance, the coverage on ARRAYs of STRUCT is missing and it fails with Failed to cast value: Unimplemented type for cast (STRUCT(a BIGINT, b BIGINT) -> STRUCT(a BIGINT, b BIGINT)[]. It's probably worth opening an issue if it's expected to work. Let's agree on the test cases next time to avoid any misunderstanding.

Thanks a lot!

Now regarding this PR? How would you like to move forward? Would you prefer to take over the execution project part in another PR and rediscuss the secrets part?

Tomorrow I'll work on the execution project thing in a different PR.

Kayrnt commented 1 month ago

In the ticket, I guess, I made clear that I don't see the benefit. I just don't wanted to have a 1-to-1 replacement for the auth in https://github.com/hafenkran/duckdb-bigquery/issues/8 without any other added value. I also mentioned access tokens as an example in https://github.com/hafenkran/duckdb-bigquery/issues/8.

As far as I understand, you can get an access token for a provided account but only if your currently authenticated account has access to it. In my current production setup, I have some cases where I have access to the service account JSON but no rights directly JSON. Also having that approaches should allow to have multiple accounts logged in for multiple ATTACH. If you were to add access token, I'm not sure what you would like to see: the name of the account to impersonate in the connection string? Then I'm not sure how it fits in the current authentication system (while the usage of the JSON is straightforward as a simple option on the google client). It's also convenient for me test the extension locally on some SA without redirecting my defaults to another JSON before running duckdb.

All tests take about 2 minutes to execute for me. Is it different for you?

Well last time I tried it didn't made it the half in 20 minutes 🤔 I could try again but it was very slow 🤯

Understood. However, for the scan it is important. Otherwise, there is no way to read from public datasets (AFAIK). For me personally, this is by far the most important use case.

Couldn't we just ATTACH bigquery-public-data project and use it?

Aah, I see. But the test is then doing no assertion in this direction. Is that right?

That one is tricky: how do you assert that it is using a specific project beside preventing (right wise) to use the "default" project to start jobs? Since we don't have access to INFORMATION_SCHEMA so far, I'm not sure how you would like to ensure it's covered beyond checking it's indeed working? (though it could be just ignored...)

I already use the BigQuery emulator for testing. All files in test/sql/local/ are written for the emulator. Unfortunately, the emulator isn't BigQuery and it doesn't behave like BigQuery. Types are sometimes treated differently; Nulls are different; Storage API behaves differently and doesn't support all modes. I also had to implement a couple of bugfixes for the emulator myself to get the basics up and running. If you want, I can provide a docker image.

Yes that would be great! I feel like it would be so much faster to use it for development. I understand that it's not the same as production ("dupes" are what they are 😉). Maybe adding the role of local vs bigquery in the test Readme could be helpful! Maybe I didn't read enough documentation but it felt like:

Thanks a lot!

Here's my create table:

CREATE OR REPLACE TABLE `project.dataset.test` AS
SELECT STRUCT(1 as a, 2 as b) test;

select test from `project.dataset.test`;

Tomorrow I'll work on the execution project thing in a different PR.

Good luck with that!

hafenkran commented 1 month ago

Again, in view of the current time, I'll keep this short.

As far as I understand, you can get an access token for a provided account but only if your currently authenticated account has access to it. In my current production setup, I have some cases where I have access to the service account JSON but no rights directly JSON. Also having that approaches should allow to have multiple accounts logged in for multiple ATTACH. If you were to add access token, I'm not sure what you would like to see: the name of the account to impersonate in the connection string? Then I'm not sure how it fits in the current authentication system (while the usage of the JSON is straightforward as a simple option on the google client). It's also convenient for me test the extension locally on some SA without redirecting my defaults to another JSON before running duckdb.

Maybe I drop a comment tomorrow.

All tests take about 2 minutes to execute for me. Is it different for you?

Well last time I tried it didn't made it the half in 20 minutes 🤔 I could try again but it was very slow 🤯

I checked the build pipelines and all tests are about 1.5 minutes to 2.5 minutes. The BQ datasets I use are also empty - one per distribution.

Couldn't we just ATTACH bigquery-public-data project and use it?

Aah, I see. But the test is then doing no assertion in this direction. Is that right?

That one is tricky: how do you assert that it is using a specific project beside preventing (right wise) to use the "default" project to start jobs? Since we don't have access to INFORMATION_SCHEMA so far, I'm not sure how you would like to ensure it's covered beyond checking it's indeed working? (though it could be just ignored...)

I have a couple of options in mind. Trying out catch; bigquery_list_jobs function; return additional infos on bigquery_execute.

Yes that would be great! I feel like it would be so much faster to use it for development. I understand that it's not the same as production ("dupes" are what they are 😉). Maybe adding the role of local vs bigquery in the test Readme could be helpful! Maybe I didn't read enough documentation but it felt like:

Tomorrow I can check.

  • The tests are running in serial mode (no concurrency)

I don't see it to be necessary that the tests run in parallel.

  • It's not possible to run a single test (so I just delete the other one temporarily if I want to run a specific one)

It is possible to run a single one: ./build/debug/test/unittest test/sql/bigquery/whatever_test_you_like.test

Thanks a lot!

Here's my create table:

CREATE OR REPLACE TABLE `project.dataset.test` AS
SELECT STRUCT(1 as a, 2 as b) test;

select test from `project.dataset.test`;

Tomorrow I'll work on the execution project thing in a different PR.

Good luck with that!

Thanks!

Kayrnt commented 1 month ago

I checked the build pipelines and all tests are about 1.5 minutes to 2.5 minutes. The BQ datasets I use are also empty - one per distribution.

Right, the problem might be that I reuse a dataset that contain a fair share of tables already 🤔 Then filling the catalog is then very slow.

I have a couple of options in mind. Trying out catch; bigquery_list_jobs function; return additional infos on bigquery_execute.

Yes that could work as long as you're ok with introducing that kind of changes 👍

I don't see it to be necessary that the tests run in parallel.

Well if it's faster it's always great 😄

It is possible to run a single one: ./build/debug/test/unittest test/sql/bigquery/whatever_test_you_like.test

Aha I should tried that 😓 I was just expecting it to be documented if it was "that simple" but I didn't see it in the makefile 😅

hafenkran commented 1 month ago

Hey, last week I gave you the adjusted version of the bigquery-emulator. Any problems so far?

I have a couple of options in mind. Trying out catch; bigquery_list_jobs function; return additional infos on bigquery_execute.

Yes that could work as long as you're ok with introducing that kind of changes 👍

Catch is already supported. I also implemented bigquery_list_jobs in another branch quite some time ago.

I don't see it to be necessary that the tests run in parallel.

Well if it's faster it's always great 😄

I just adopted standards here :/

It is possible to run a single one: ./build/debug/test/unittest test/sql/bigquery/whatever_test_you_like.test

Aha I should tried that 😓 I was just expecting it to be documented if it was "that simple" but I didn't see it in the makefile 😅

I don't know. I took it from the Makefile.

As far as I understand, you can get an access token for a provided account but only if your currently authenticated account has access to it. In my current production setup, I have some cases where I have access to the service account JSON but no rights directly JSON. Also having that approaches should allow to have multiple accounts logged in for multiple ATTACH. If you were to add access token, I'm not sure what you would like to see: the name of the account to impersonate in the connection string? Then I'm not sure how it fits in the current authentication system (while the usage of the JSON is straightforward as a simple option on the google client). It's also convenient for me test the extension locally on some SA without redirecting my defaults to another JSON before running duckdb.

Theoretically, one could use --impersonate-service-account=.. with tokens, but that's not really the point here. I think the questions are a bit off-track. My intent was just to get more out of this than just offering another JSON-based auth.

Kayrnt commented 1 month ago

Hey, last week I gave you the adjusted version of the bigquery-emulator. Any problems so far?

That's next on my list, I've been busy with other OSS contributions and blog writing. I'm planning to POC the emulator at work too so it's definitely something I'll look at soon.

I just adopted standards here :/

Sure, I'm just used to have a lot of stuff in parallel but it's not a problem

I don't know. I took it from the Makefile.

Yeah I just assumed that it was taking only a directory but I didn't test passing a file directly and using the executable instead of expecting a make command that would read a file as an argument (probably worth adding it to the ci tools repo by the way!)

Theoretically, one could use --impersonate-service-account=.. with tokens, but that's not really the point here. I think the questions are a bit off-track. My intent was just to get more out of this than just offering another JSON-based auth.

Ok do you have a target in mind? I didn't thoroughly planned a design to support everything but my initial guess was to start with the JSON then support other keys than service_account_json to match the other auth approaches.