Closed antonysouthworth-halter closed 4 months ago
Just tested this locally in our dbt setup, here's an example of what the CloudTrail log would look like:
{
"eventVersion": "1.09",
"userIdentity": {
"type": "AssumedRole",
"principalId": "REDACTED",
"arn": "REDACTED",
"accountId": "REDACTED",
"accessKeyId": "REDACTED",
"sessionContext": {
"sessionIssuer": {
"type": "Role",
"principalId": "REDACTED",
"arn": "REDACTED",
"accountId": "REDACTED",
"userName": "REDACTED"
},
"attributes": {
"creationDate": "2024-07-11T22:21:56Z",
"mfaAuthenticated": "REDACTED"
}
}
},
"eventTime": "2024-07-11T22:24:44Z",
"eventSource": "athena.amazonaws.com",
"eventName": "StartQueryExecution",
"awsRegion": "ap-southeast-2",
"sourceIPAddress": "REDACTED",
"userAgent": "Boto3/1.34.69 md/Botocore#1.34.69 ua/2.0 os/macos#22.6.0 md/arch#x86_64 lang/python#3.9.19 md/pyimpl#CPython cfg/retry-mode#standard Botocore/1.34.69 PyAthena/3.8.2 dbt-athena-community/1.8.0rc1 QueryTraceId=9ed7073d-2b8d-430c-a7f2-981e1814f1f1",
"requestParameters": {
"queryString": "***OMITTED***",
"clientRequestToken": "REDACTED",
"queryExecutionContext": {
"database": "REDACTED",
"catalog": "awsdatacatalog"
},
"resultConfiguration": {
"outputLocation": "REDACTED"
},
"workGroup": "REDACTED"
},
"responseElements": {
"queryExecutionId": "REDACTED"
},
"requestID": "REDACTED",
"eventID": "REDACTED",
"readOnly": false,
"eventType": "AwsApiCall",
"managementEvent": true,
"recipientAccountId": "REDACTED",
"eventCategory": "Management",
"tlsDetails": {
"tlsVersion": "TLSv1.3",
"cipherSuite": "TLS_AES_128_GCM_SHA256",
"clientProvidedHostHeader": "athena.ap-southeast-2.amazonaws.com"
}
}
Lots there of course, here's the important bits:
{
"eventVersion": "1.09",
"eventTime": "2024-07-11T22:24:44Z",
"eventSource": "athena.amazonaws.com",
"eventName": "StartQueryExecution",
"userAgent": "Boto3/1.34.69 md/Botocore#1.34.69 ua/2.0 os/macos#22.6.0 md/arch#x86_64 lang/python#3.9.19 md/pyimpl#CPython cfg/retry-mode#standard Botocore/1.34.69 PyAthena/3.8.2 dbt-athena-community/1.8.0rc1 QueryTraceId=9ed7073d-2b8d-430c-a7f2-981e1814f1f1",
"requestParameters": {
"queryString": "***OMITTED***",
"workGroup": "REDACTED"
},
"responseElements": {
"queryExecutionId": "REDACTED"
},
"requestID": "REDACTED",
"eventID": "REDACTED"
}
QueryTraceId=9ed7073d-2b8d-430c-a7f2-981e1814f1f1
injected into the User-Agent.
It took me a while to get started trying to contribute to this project since there's no clear instructions I could find and I have never used hatch
or even really much pyproject.toml
stuff before; so adding a basic CONTRIBUTING.md doc too to hopefully help others in future.
So...... I don't know why I thought S3 GETs have this User-Agent string in them, must have got my wires crossed somewhere. Closing.
https://github.com/dbt-athena/dbt-athena/issues/686
Context here is if you are spending a large amount of money on S3 GET requests it can be quite difficult to track down exactly which query is originating the requests. Yes, you can get pretty close with:
But these are all quite indirect. Instead I propose injecting some unique identifier into the User-Agent header for each StartQueryExecution call. That way, the User-Agent string now uniquely identifies a StartQueryExecution request and will be passed along to the GetObject requests, allowing us to associate GetObject requests with specific query executions.
Yes, this is janky as shit and not really a smart way of doing it and just generally probably considered _ab_use of the User-Agent header.... but there doesn't seem to really be another alternative so....
You can use the User-Agent and responseElements of the CloudTrail log line for the StartQueryExecution call to associate the GETs with an Athena QueryExecutionId, which can in turn be used to look up the QueryText (not incuded in CloudTrail logs, instead available from athena:GetQueryExecution).
TODO: