laughingman7743 / PyAthena

PyAthena is a Python DB API 2.0 (PEP 249) client for Amazon Athena.
MIT License
464 stars 105 forks source link

Trace S3 GET requests back to Athena queries. #552

Closed antonysouthworth-halter closed 4 months ago

antonysouthworth-halter commented 4 months ago

https://github.com/dbt-athena/dbt-athena/issues/686

Context here is if you are spending a large amount of money on S3 GET requests it can be quite difficult to track down exactly which query is originating the requests. Yes, you can get pretty close with:

But these are all quite indirect. Instead I propose injecting some unique identifier into the User-Agent header for each StartQueryExecution call. That way, the User-Agent string now uniquely identifies a StartQueryExecution request and will be passed along to the GetObject requests, allowing us to associate GetObject requests with specific query executions.

Yes, this is janky as shit and not really a smart way of doing it and just generally probably considered _ab_use of the User-Agent header.... but there doesn't seem to really be another alternative so....

You can use the User-Agent and responseElements of the CloudTrail log line for the StartQueryExecution call to associate the GETs with an Athena QueryExecutionId, which can in turn be used to look up the QueryText (not incuded in CloudTrail logs, instead available from athena:GetQueryExecution).

TODO:

antonysouthworth-halter commented 4 months ago

Just tested this locally in our dbt setup, here's an example of what the CloudTrail log would look like:

{
    "eventVersion": "1.09",
    "userIdentity": {
        "type": "AssumedRole",
        "principalId": "REDACTED",
        "arn": "REDACTED",
        "accountId": "REDACTED",
        "accessKeyId": "REDACTED",
        "sessionContext": {
            "sessionIssuer": {
                "type": "Role",
                "principalId": "REDACTED",
                "arn": "REDACTED",
                "accountId": "REDACTED",
                "userName": "REDACTED"
            },
            "attributes": {
                "creationDate": "2024-07-11T22:21:56Z",
                "mfaAuthenticated": "REDACTED"
            }
        }
    },
    "eventTime": "2024-07-11T22:24:44Z",
    "eventSource": "athena.amazonaws.com",
    "eventName": "StartQueryExecution",
    "awsRegion": "ap-southeast-2",
    "sourceIPAddress": "REDACTED",
    "userAgent": "Boto3/1.34.69 md/Botocore#1.34.69 ua/2.0 os/macos#22.6.0 md/arch#x86_64 lang/python#3.9.19 md/pyimpl#CPython cfg/retry-mode#standard Botocore/1.34.69 PyAthena/3.8.2 dbt-athena-community/1.8.0rc1 QueryTraceId=9ed7073d-2b8d-430c-a7f2-981e1814f1f1",
    "requestParameters": {
        "queryString": "***OMITTED***",
        "clientRequestToken": "REDACTED",
        "queryExecutionContext": {
            "database": "REDACTED",
            "catalog": "awsdatacatalog"
        },
        "resultConfiguration": {
            "outputLocation": "REDACTED"
        },
        "workGroup": "REDACTED"
    },
    "responseElements": {
        "queryExecutionId": "REDACTED"
    },
    "requestID": "REDACTED",
    "eventID": "REDACTED",
    "readOnly": false,
    "eventType": "AwsApiCall",
    "managementEvent": true,
    "recipientAccountId": "REDACTED",
    "eventCategory": "Management",
    "tlsDetails": {
        "tlsVersion": "TLSv1.3",
        "cipherSuite": "TLS_AES_128_GCM_SHA256",
        "clientProvidedHostHeader": "athena.ap-southeast-2.amazonaws.com"
    }
}

Lots there of course, here's the important bits:

{
    "eventVersion": "1.09",
    "eventTime": "2024-07-11T22:24:44Z",
    "eventSource": "athena.amazonaws.com",
    "eventName": "StartQueryExecution",
    "userAgent": "Boto3/1.34.69 md/Botocore#1.34.69 ua/2.0 os/macos#22.6.0 md/arch#x86_64 lang/python#3.9.19 md/pyimpl#CPython cfg/retry-mode#standard Botocore/1.34.69 PyAthena/3.8.2 dbt-athena-community/1.8.0rc1 QueryTraceId=9ed7073d-2b8d-430c-a7f2-981e1814f1f1",
    "requestParameters": {
        "queryString": "***OMITTED***",
        "workGroup": "REDACTED"
    },
    "responseElements": {
        "queryExecutionId": "REDACTED"
    },
    "requestID": "REDACTED",
    "eventID": "REDACTED"
}

QueryTraceId=9ed7073d-2b8d-430c-a7f2-981e1814f1f1 injected into the User-Agent.

antonysouthworth-halter commented 4 months ago

It took me a while to get started trying to contribute to this project since there's no clear instructions I could find and I have never used hatch or even really much pyproject.toml stuff before; so adding a basic CONTRIBUTING.md doc too to hopefully help others in future.

antonysouthworth-halter commented 4 months ago

So...... I don't know why I thought S3 GETs have this User-Agent string in them, must have got my wires crossed somewhere. Closing.