hrbrmstr / sergeant

:guardsman: Tools to Transform and Query Data with 'Apache' 'Drill'
https://hrbrmstr.github.io/sergeant/
Other
126 stars 13 forks source link

Add example of accessing S3 files #25

Open hrbrmstr opened 5 years ago

hrbrmstr commented 5 years ago

https://issues.apache.org/jira/browse/DRILL-6662 makes it possible to use non-hardcoded creds so it finally makes sense to add some examples of how to query S3 data.

davidski commented 5 years ago

If your example could include referencing a specific regional endpoint, that would be A++ good. My first attempt at getting IAM roles and a regionally scoped bucket call to work failed and I've yet to go back and make another attempt.

hrbrmstr commented 5 years ago

I gave it a quick try the day 1.15.0 came out but didn't go back to it.

davidski commented 5 years ago

Looks like I just needed to come back to this. Got this working on a us-west-2 S3 endpoint with the following (excessively verbose) storage configuration:

{
  "type": "file",
  "connection": "s3a://cloudy-mccloudface",
  "config": {
    "fs.s3a.aws.credentials.provider": "com.amazonaws.auth.InstanceProfileCredentialsProvider",
    "fs.s3a.endpoint": "s3.us-west-2.amazonaws.com"
  },
  "workspaces": {
    "tmp": {
      "location": "/tmp",
      "writable": true,
      "defaultInputFormat": null,
      "allowAccessOutsideWorkspace": false
    },
    "root": {
      "location": "/",
      "writable": false,
      "defaultInputFormat": null,
      "allowAccessOutsideWorkspace": false
    },
    "csvs": {
      "location": "/csvs",
      "writable": false,
      "defaultInputFormat": null,
      "allowAccessOutsideWorkspace": false
    }
  },
  "formats": {
    "psv": {
      "type": "text",
      "extensions": [
        "tbl"
      ],
      "delimiter": "|"
    },
    "csv": {
      "type": "text",
      "extensions": [
        "csv"
      ],
      "delimiter": ","
    },
    "tsv": {
      "type": "text",
      "extensions": [
        "tsv"
      ],
      "delimiter": "\t"
    },
    "parquet": {
      "type": "parquet"
    },
    "json": {
      "type": "json",
      "extensions": [
        "json"
      ]
    },
    "avro": {
      "type": "avro"
    },
    "sequencefile": {
      "type": "sequencefile",
      "extensions": [
        "seq"
      ]
    },
    "csvh": {
      "type": "text",
      "extensions": [
        "csvh"
      ],
      "extractHeader": true,
      "delimiter": ","
    }
  },
  "enabled": true
}

A bit too shagged out to write this up properly right now, so dumping the config as a reminder for later.