pydata / parallel-tutorial

Parallel computing in Python tutorial materials
303 stars 111 forks source link

Spark S3 access #13

Closed mrocklin closed 7 years ago

mrocklin commented 7 years ago

I suspect that this is because we have yet to include credentials. Still, reporting it here just in case.

rdd = sc.textFile("s3a://githubarchive-data/2015-01-01-*.json.gz")
rdd.take(2)

Py4JJavaError: An error occurred while calling o25.partitions.
: com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: 2EEACD7FB2D731F3, AWS Error Code: InvalidAccessKeyId, AWS Error Message: The AWS Access Key Id you provided does not exist in our records., S3 Extended Request ID: GZakDXNA+ZbAdi7xp24fiIxWJEZIatmAi4P2sCg+w5+XbidUGy88hcGZ/ZwNDq/8GZGwycJUS6g=
quasiben commented 7 years ago

resolved with proper creds