turbot / steampipe-plugin-csv

Use SQL to instantly query data from CSV files. Open source CLI. No DB required.
https://hub.steampipe.io/plugins/turbot/csv
Apache License 2.0
19 stars 4 forks source link

CSV@private S3: automatic resolution of the region #57

Closed ajoga closed 1 year ago

ajoga commented 1 year ago

Is your feature request related to a problem? Please describe. The documentation states :

Make sure that region is configured in the config. If not set in the config, region will be fetched from the standard environment variable AWS_REGION.

I'm confused as to why this is needed as the region is in the hostname of the S3 path to the file or folder.

The documentation suggests the use of AWS profiles, so if one were to have csv files in two regions, he'd have to configure two different AWS profiles for Steampipe, which is at odd with most of other tooling using AWS credentials.

The documentation also suggest to pass the region but it's not in the exemple and I can't get it to work, passing region, aws_region parameter is not recognized:

$ steampipe query
Welcome to Steampipe v0.19.3
For more information, type .help
Warning: failed to start plugin 'hub.steampipe.io/plugins/turbot/csv@latest': failed to get directory specified by the source s3::https://XXXXXXX.s3.eu-west-1.amazonaws.com/XXXX.csv?aws_profile=aa&aws_region=eu-west-1: error downloading 'https://XXXXXXX.s3.eu-west-1.amazonaws.com/XXXXXXX.csv?aws_profile=aa&aws_region=eu-west-1': MissingRegion: could not find region configuration

Setting the region corresponding to the bucket location in ~/.aws/credentials (region=eu-west-1) works.

Setting the incorrect region in ~/.aws/credentials yields to this error upon Steampipe invokation: BucketRegionError: incorrect region, the bucket is not in 'eu-central-1' region

Describe the solution you'd like I think this feature should not expect a region to be given anyhow, worst case scenario it can be parsed from the hostname : https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-bucket-intro.html

versions plugin csv 0.7.0 Steampipe v0.19.3

cbruno10 commented 1 year ago

Hey @ajoga , can you please share some examples of paths you've tried in your csv.spc and what Steampipe returned once you entered steampipe query? For troubleshooting around paths, we've found that stating the config files and outcomes is helpful due to the number of possibilities.

For some more background on how we use the URLs passed into paths, we pass these URLs into the GetSourceFiles function from the Steampipe Plugin SDK, which eventually uses the hashicorp/go-getter library.

From their README, we've followed their examples, e.g., https://github.com/hashicorp/go-getter#s3-bucket-examples. They don't explicitly mention how they resolve regions, but looking at their code, it does look like they try to get the region from parsing the URL, and then they use the standard AWS Go SDK to list objects.

ajoga commented 1 year ago

Hey @cbruno10, sure!

I tried this config line : paths = [ "s3::https://XXXXXXX1-XXXXXXX.s3.eu-west-1.amazonaws.com/XXXXXXX.csv?aws_region=eu-west-1&aws_profile=aa" ]

I'd rather not publish the hostname, but in case it matters, it includes [a-z], a digit and a dash.

Starting steampipe query interestingly does not output any error now for some reason. However, in /.steampipe/logs/plugin-2023-04-05.log I see :

2023-04-05 16:35:25.247 UTC [ERROR] steampipe-plugin-csv.plugin: [ERROR] csv.csvList: failed to fetch absolute path="failed to get directory specified by the source s3::https://XXXXXXX1-XXXXXXX.s3.eu-west-1.amazonaws.com/XXXXXXX.csv?aws_profile=aa&aws_region=eu-west-1: error downloading 'https://XXXXXXX1-XXXXXXX.s3.eu-west-1.amazonaws.com/XXXXXXX.csv?aws_profile=aa&aws_region=eu-west-1': MissingRegion: could not find region configuration" path="s3::https://XXXXXXX1-XXXXXXX.s3.eu-west-1.amazonaws.com/XXXXXXX.csv?aws_region=eu-west-1&aws_profile=aa"
2023-04-05 16:35:25.261 UTC [WARN]  failed to set connection config: failed to get directory specified by the source s3::https://XXXXXXX1-XXXXXXX.s3.eu-west-1.amazonaws.com/XXXXXXX.csv?aws_profile=aa&aws_region=eu-west-1: error downloading 'https://XXXXXXX1-XXXXXXX.s3.eu-west-1.amazonaws.com/XXXXXXX.csv?aws_profile=aa&aws_region=eu-west-1': MissingRegion: could not find region configuration

With a csv.spc containing paths = [ "s3::https://XXXXXXX1-XXXXXXX.s3.eu-west-1.amazonaws.com/XXXXXXX.csv?region=eu-west-1&aws_profile=aa" ], the errors are the same.

Note that the csv.spc has no other paths specified, and all config keys are by default (ie: commented). There is also no other plugins enabled, as I reproduced this issue in a new VM

ajoga commented 1 year ago

which eventually uses the hashicorp/go-getter library.

@cbruno10 , it looked like this issue in their project is related: https://github.com/hashicorp/go-getter/issues/393 ...

I'm unable to read the go code, but if indeed the parsing of the url only goes through the S3Path method, then this could be the source for my main concern: having to explicit a region because I used vhost-style.

So I tried to use the path-style config, with no success.

paths = [ "s3::https://s3.eu-west-1.amazonaws.com/XXXX1-XXXXX/XXXXX.csv"] & region in credentials file makes steampipe query return Warning: failed to start plugin 'hub.steampipe.io/plugins/turbot/csv@latest': failed to get directory specified by the source s3::https://s3.eu-west-1.amazonaws.com/XXXXXXXXXXXXXXXX-XXXXXXXXXXXXXXXX/XXXXXXXXXXXXXXXX.csv: error downloading 'https://s3.eu-west-1.amazonaws.com/XXXXXXXXXXXXXXXX-XXXXXXXXXXXXXXXX/XXXXXXXXXXXXXXXX.csv': InvalidBucketName: The specified bucket is not valid. status code: 400, request id: XXXXXXXXXXXXXXXX, host id: XXXXXXXXXXXXXXXX=

exchanging the dot for a dash after https://s3, like paths = [ "s3::https://s3-eu-west-1.amazonaws.com/XXXXXXXXXXXXXXXX-XXXXXXXXXXXXXXXX/XXXXXXXXXXXXXXXX.csv?aws_profile=aa"] & with region in credentials file works!

However, paths = [ "s3::https://s3-eu-west-1.amazonaws.com/XXXXXXXXXXXXXXXX-XXXXXXXXXXXXXXXX/XXXXXXXXXXXXXXXX.csv?aws_profile=aa&region=eu-west-1"] with no region in credentials yields on steampipe query invocation the error Warning: failed to start plugin 'hub.steampipe.io/plugins/turbot/csv@latest': failed to get directory specified by the source s3::https://s3-eu-west-1.amazonaws.com/XXXXXXXXXXXXXXXX-XXXXXXXXXXXXXXXX/XXXXXXXXXXXXXXXX.csv?aws_profile=aa&region=eu-west-1: error downloading 'https://s3-eu-west-1.amazonaws.com/XXXXXXXXXXXXXXXX-XXXXXXXXXXXXXXXX/XXXXXXXXXXXXXXXX.csv?aws_profile=aa&region=eu-west-1': MissingRegion: could not find region configuration

cbruno10 commented 1 year ago

@ajoga I don't think I need the specific hostname, so what you sent over is sufficient.

Looking at the combination of query parameters in https://github.com/hashicorp/go-getter#s3-s3 and the sections below, it seems like aws_profile and region are not intended to be used together, and region should only be used with the aws_access_key_id and aws_access_key_secret params.

I also found this issue which has some working and non-working examples according to the issue author, https://github.com/hashicorp/go-getter/issues/387.

Do any of the examples in the issue above work for your use case?

ajoga commented 1 year ago

Hi @cbruno10

path plugin-*.log content upon steampipe query call note
paths = [ "xxx123-xxxxx.s3.amazonaws.com/AAAaaaa.csv?aws_profile=aa"] 2023-04-06 07:22:41.292 UTC [ERROR] steampipe-plugin-csv.plugin: [ERROR] csv.csvList: failed to fetch absolute path="failed to get directory specified by the source xxx123-xxxxx.s3.amazonaws.com/AAAaaaa.csv?aws_profile=aa: error downloading 'https://s3.amazonaws.com/xxx123-xxxxx/AAAaaaa.csv?aws_profile=aa': MissingRegion: could not find region configuration" path=xxx123-xxxxx.s3.amazonaws.com/AAAaaaa.csv?aws_profile=aa \n 2023-04-06 07:22:41.302 UTC [WARN] failed to set connection config: failed to get directory specified by the source xxx123-xxxxx.s3.amazonaws.com/AAAaaaa.csv?aws_profile=aa: error downloading 'https://s3.amazonaws.com/xxx123-xxxxx/AAAaaaa.csv?aws_profile=aa': MissingRegion: could not find region configuration makes somewhat sense but this path style should resolve to us-east-1
paths = [ "xxx123-xxxxx.eu-west-1.s3.amazonaws.com/AAAaaaa.csv?aws_profile=aa"] 2023-04-06 07:30:18.327 UTC [ERROR] steampipe-plugin-csv.plugin: [ERROR] csv.csvList: failed to fetch absolute path="failed to get directory specified by the source xxx123-xxxxx.eu-west-1.s3.amazonaws.com/AAAaaaa.csv?aws_profile=aa: URL is not a valid S3 URL" path=xxx123-xxxxx.eu-west-1.s3.amazonaws.com/AAAaaaa.csv?aws_profile=aa \n 2023-04-06 07:30:18.340 UTC [WARN] failed to set connection config: failed to get directory specified by the source xxx123-xxxxx.eu-west-1.s3.amazonaws.com/AAAaaaa.csv?aws_profile=aa: URL is not a valid S3 URL ???
paths = [ "s3.eu-west-1.amazonaws.com/xxx123-xxxxx/AAAaaaa.csv?aws_profile=aa"] 2023-04-06 07:48:52.432 UTC [ERROR] steampipe-plugin-csv.plugin: [ERROR] csv.csvList: failed to fetch absolute path="failed to get directory specified by the source s3.eu-west-1.amazonaws.com/xxx123-xxxxx/AAAaaaa.csv?aws_profile=aa: error downloading 'https://eu-west-1.amazonaws.com/s3/xxx123-xxxxx/AAAaaaa.csv?aws_profile=aa': MissingRegion: could not find region configuration" path=s3.eu-west-1.amazonaws.com/xxx123-xxxxx/AAAaaaa.csv?aws_profile=aa \n 2023-04-06 07:48:52.445 UTC [WARN] failed to set connection config: failed to get directory specified by the source s3.eu-west-1.amazonaws.com/xxx123-xxxxx/AAAaaaa.csv?aws_profile=aa: error downloading 'https://eu-west-1.amazonaws.com/s3/xxx123-xxxxx/AAAaaaa.csv?aws_profile=aa': MissingRegion: could not find region configuration ???
paths = [ "s3://s3.amazonaws.com/xxx123-xxxxx/AAAaaaa.csv?aws_profile=aa"] 2023-04-06 07:41:43.493 UTC [ERROR] steampipe-plugin-csv.plugin: [ERROR] csv.csvList: failed to fetch absolute path="failed to get directory specified by the source s3://s3.amazonaws.com/xxx123-xxxxx/AAAaaaa.csv?aws_profile=aa: error downloading 's3://s3.amazonaws.com/xxx123-xxxxx/AAAaaaa.csv?aws_profile=aa': MissingRegion: could not find region configuration" path=s3://s3.amazonaws.com/xxx123-xxxxx/AAAaaaa.csv?aws_profile=aa \n 2023-04-06 07:41:43.523 UTC [WARN] failed to set connection config: failed to get directory specified by the source s3://s3.amazonaws.com/xxx123-xxxxx/AAAaaaa.csv?aws_profile=aa: error downloading 's3://s3.amazonaws.com/xxx123-xxxxx/AAAaaaa.csv?aws_profile=aa': MissingRegion: could not find region configuration makes somewhat sense but this path style should resolve to us-east-1
paths = [ "s3://s3-eu-west-1.amazonaws.com/xxx123-xxxxx/AAAaaaa.csv?aws_profile=aa"] 2023-04-06 07:38:50.159 UTC [ERROR] steampipe-plugin-csv.plugin: [ERROR] csv.csvList: failed to fetch absolute path="failed to get directory specified by the source s3://s3-eu-west-1.amazonaws.com/xxx123-xxxxx/AAAaaaa.csv?aws_profile=aa: error downloading 's3://s3-eu-west-1.amazonaws.com/xxx123-xxxxx/AAAaaaa.csv?aws_profile=aa': MissingRegion: could not find region configuration" path=s3://s3-eu-west-1.amazonaws.com/xxx123-xxxxx/AAAaaaa.csv?aws_profile=aa \n 2023-04-06 07:38:50.196 UTC [WARN] failed to set connection config: failed to get directory specified by the source s3://s3-eu-west-1.amazonaws.com/xxx123-xxxxx/AAAaaaa.csv?aws_profile=aa: error downloading 's3://s3-eu-west-1.amazonaws.com/xxx123-xxxxx/AAAaaaa.csv?aws_profile=aa': MissingRegion: could not find region configuration ???
paths = [ "s3::https://s3.amazonaws.com/xxx123-xxxxx/AAAaaaa.csv?aws_profile=aa"] 2023-04-06 07:43:58.893 UTC [ERROR] steampipe-plugin-csv.plugin: [ERROR] csv.csvList: failed to fetch absolute path="failed to get directory specified by the source s3::https://s3.amazonaws.com/xxx123-xxxxx/AAAaaaa.csv?aws_profile=aa: error downloading 'https://s3.amazonaws.com/xxx123-xxxxx/AAAaaaa.csv?aws_profile=aa': MissingRegion: could not find region configuration" path=s3::https://s3.amazonaws.com/xxx123-xxxxx/AAAaaaa.csv?aws_profile=aa \n 2023-04-06 07:43:58.920 UTC [WARN] failed to set connection config: failed to get directory specified by the source s3::https://s3.amazonaws.com/xxx123-xxxxx/AAAaaaa.csv?aws_profile=aa: error downloading 'https://s3.amazonaws.com/xxx123-xxxxx/AAAaaaa.csv?aws_profile=aa': MissingRegion: could not find region configuration makes somewhat sense but this path style should resolve to us-east-1
paths = [ "s3::https://s3-eu-west-1.amazonaws.com/xxx123-xxxxx/AAAaaaa.csv?aws_profile=aa"] 2023-04-06 07:47:21.064 UTC [ERROR] steampipe-plugin-csv.plugin: [ERROR] csv.csvList: failed to fetch absolute path="failed to get directory specified by the source s3::https://s3-eu-west-1.amazonaws.com/xxx123-xxxxx/AAAaaaa.csv?aws_profile=aa: error downloading 'https://s3-eu-west-1.amazonaws.com/xxx123-xxxxx/AAAaaaa.csv?aws_profile=aa': MissingRegion: could not find region configuration" path=s3::https://s3-eu-west-1.amazonaws.com/xxx123-xxxxx/AAAaaaa.csv?aws_profile=aa \n 2023-04-06 07:47:21.081 UTC [WARN] failed to set connection config: failed to get directory specified by the source s3::https://s3-eu-west-1.amazonaws.com/xxx123-xxxxx/AAAaaaa.csv?aws_profile=aa: error downloading 'https://s3-eu-west-1.amazonaws.com/xxx123-xxxxx/AAAaaaa.csv?aws_profile=aa': MissingRegion: could not find region configuration ???
paths = [ "s3::https://s3-eu-west-1.amazonaws.com/xxx123-xxxxx/AAAaaaa.csv?aws_profile=aa"] 2023-04-06 07:47:21.064 UTC [ERROR] steampipe-plugin-csv.plugin: [ERROR] csv.csvList: failed to fetch absolute path="failed to get directory specified by the source s3::https://s3-eu-west-1.amazonaws.com/xxx123-xxxxx/AAAaaaa.csv?aws_profile=aa: error downloading 'https://s3-eu-west-1.amazonaws.com/xxx123-xxxxx/AAAaaaa.csv?aws_profile=aa': MissingRegion: could not find region configuration" path=s3::https://s3-eu-west-1.amazonaws.com/xxx123-xxxxx/AAAaaaa.csv?aws_profile=aa \n 2023-04-06 07:47:21.081 UTC [WARN] failed to set connection config: failed to get directory specified by the source s3::https://s3-eu-west-1.amazonaws.com/xxx123-xxxxx/AAAaaaa.csv?aws_profile=aa: error downloading 'https://s3-eu-west-1.amazonaws.com/xxx123-xxxxx/AAAaaaa.csv?aws_profile=aa': MissingRegion: could not find region configuration ???

The transformation of CRLF to \n are mine to allow formatting in table.

From these tests it feels like if aws_profile is specified, then the logic for URL parsing to determine the region is not used at all.

But overall from a end-user perspective, all these URI schemes seems odd ; the AWS console gives for an S3 object two URI schemes to access an object, and none of them are part of the examples of the lib go-getter :

Screenshot 2023-04-06 100305

(bucket name in green, object name in red)

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 30 days.

cbruno10 commented 1 year ago

Hey @ajoga , sorry for the long response time!

I had another dive into this issue and what formats were supported by go-getter, but I walked away even less confident than when I started.

Most of our examples and testing in https://hub.steampipe.io/plugins/turbot/csv#accessing-a-private-bucket were based off of https://github.com/hashicorp/go-getter#s3-bucket-examples, but we admittedly ran into some questions/issues along the way:

We didn't find their documentation very helpful, and it seems like other users of that package also had questions around what format to use, e.g., https://github.com/hashicorp/go-getter/issues/387.

In our SDK, we do some S3 path handling, but I'm not sure this is the cause of the errors, as in the plugin log error messages you've included above, the URLs look like they match what you have in your paths config argument.

If we have any examples or key information missing in our doc based off of your tests, can you please raise a PR adding this information (which we could then push to other plugin docs that support retrieving files with go-getter)?

I think performing exhaustive tests is a bit of a blackhole based on the lack of guidance from the go-getter package and the large number of possible URL and query param combinations, so it may be better to provide examples that work consistently in our docs.

If you have any other questions or thoughts, please let us know!

Subhajit97 commented 1 year ago

@ajoga As a continuation of @cbruno10's message above, you can also refer to our unit tests, where we have a few different path formats defined which you may try out. Thanks!

cbruno10 commented 1 year ago

@Subhajit97 From the unit tests you linked, are any of those worth adding into our docs?

Subhajit97 commented 1 year ago

@cbruno10 IIRC, in our docs, we prefer a consistent path format that works with private and public buckets. The above unit tests are mostly targeted to a public S3 bucket so that we can test those and go-getter fails for some of them if it is a private bucket.

We can add a few of them to provide the format the go-getter support, for example:

cbruno10 commented 1 year ago

@Subhajit97 Do those 2 formats work with private S3 buckets? If so, how would I pass in authentication information, e.g., the profile name?

Subhajit97 commented 1 year ago

@cbruno10, both the format mentioned above works with private S3 buckets. For authentication, the profile name can be mentioned in the paths as defined in the plugin docs.

For example:

connection "csv" {
  plugin = "csv"

  paths = [
    "deletebucket12092023.s3-us-east-1.amazonaws.com/CSVs//*.csv?aws_profile=default"
  ]
}

Screenshot 2023-09-12 at 8 15 40 PM

cbruno10 commented 1 year ago

Hey @ajoga , thanks again for doing some extensive testing earlier.

For now, we'd recommend using one of the formats that @Subhajit97 had mentioned in https://github.com/turbot/steampipe-plugin-csv/issues/57#issuecomment-1705249483 along with AWS profile credentials.

In terms of figuring out which S3 URL formats go-getter accepts, we found it difficult based on lack of documentation/examples from the go-getter repository and related docs. Exhaustively trying to test them all (like you've done) is time consuming and produces some unexpected results/errors, so in general, we don't try to use exhaustive testing, but instead use the known working formats (which usually include the region in the URL).

If these formats do not work for you though, please let us know and we can dig into these specifically.

Thanks!