harrystech / arthur-redshift-etl

ELT Code for your Data Warehouse
MIT License
26 stars 11 forks source link

Getting-started guides? #238

Open bhtucker opened 4 years ago

bhtucker commented 4 years ago

Summary

I'm trying to set up a fresh project and wonder if there are any templates for the 'sibling' repo. (I have the fortunate position of vaguely remembering how this should work, and still I'm stuck!)

By banging my head against the validator, I eventually came up with a dummy warehouse config (uselessly passes the validator):

{
  "arthur_settings": {},
  "data_warehouse": {},
  "type_maps": {},
  "object_store": {
    "s3": {
      "bucket_name": "load-bucket",
      "iam_role": "arn:aws:iam::123:role/NotARole"
    }
  },
  "resources": {
    "key_name": "my-fake-ssh-key",
    "VPC": {
      "region": "us-east-1",
      "account": "123",
      "name": "MyVPC",
      "public_subnet": "PublicSubnet",
      "whitelist_security_group": "sg-123"
    },
    "DataPipeline": {
      "role": "NotARole"
    },
    "EC2": {
      "instance_type": "m5.4xlarge",
      "image_id": "",
      "public_security_group": "foobar",
      "iam_instance_profile": "instanceprofile"
    },
    "EMR": {
      "master": {
        "instance_type": "m5.4xlarge",
        "managed_security_group": "foobar"
      },
      "core": {
        "instance_type": "m5.4xlarge",
        "managed_security_group": "foobar"
      },
      "release_label": "emr-5.29.0"
    }
  },
  "etl_events": {}
}

Now I need to set up my prefix, with e.g. bootstrapping scripts as well as sync output. I guess this is upload_env.sh?

Anyway, if I'm missing existing assets, I'd love to use them -- and if not, it would be good to know, so I can write down what I do!

Details

At the moment I'm just trying to use extract.

Labels Please set the label on the issue so that

I don't think I have 'edit' rights on the labels

tvogels01 commented 4 years ago

Just for testing, I created a config directory inside the arthur-redshift-etl directory. Then I built a minimal set of config files.

Here's a PR that makes this easier: #241

mkdir config
export DATA_WAREHOUSE_CONFIG=`pwd`/config

cp etc/aws_template.yaml config/aws.yaml
cp etc/warehouse_template.yaml config/warehouse.yaml
cp etc/credentials.sh.template config/credentials.sh

Take a look at the templates. The aws.yaml config file needs to be updated based on outputs from the CloudFormation stack. Looking at the config file posted, it just might be easier than you remember. We've made some improvements.

Starting Arthur now:

bin/run_arthur.sh

This will show some settings. Take a look at the rest:

arthur.py settings

And now make sure that S3 has the ETL code:

upload_env.sh 

Without changes to the template this fails, of course, but you'll update aws.yaml.

Finding bucket name and prefix in configuration...

An error occurred (AccessDenied) when calling the ListObjectsV2 operation: Access Denied
Check whether the bucket "object-store" exists and you have access to it!

Then create a table design file:

arthur.py bootstrap_sources webapp

This also fails because you need a credentials file with the connections, see prompts in config/credentials.sh.

After you've setup the credentials (with connection strings), don't forget to copy the file to s3.

Once you have a design file, upload the local schemas to S3:

arthur.py sync --deploy-config

And now run one of:

arthur.py extract
install_extraction_pipeline.sh

Let me know which hurdles you encounter and I'll try to get them resolved.

bhtucker commented 4 years ago

Thank you for the guidance!

bhtucker commented 4 years ago

This worked great. The one hiccup was: credentials.sh seems to be uploaded out-of-band, right? Neither upload_env nor arthur.py sync end up copying it?

tvogels01 commented 4 years ago

Yes, unfortunately. You'll have to create and upload the credentials.sh file manually.