nih-cfde / icc-eval-core

(WIP) Tools for collecting and reporting CFDE metrics
https://cfde-eval.netlify.app/
1 stars 1 forks source link

Capability to ingest (google) analytics "centrally" #14

Closed seandavi closed 2 months ago

seandavi commented 4 months ago

Just so we capture what we can. Obviously nothing's ready for primetime just yet, but....

vincerubinetti commented 3 months ago

Assuming you meant this for our dashboard itself, and not any webapps associated with CF projects. That'd be on them to add their own analytics, though we can have this issue track adding a submittal process and ingest of their analytics tokens.

vincerubinetti commented 3 months ago

Look into credential technicalities. Do we need to get a token for every analytics account? Can each analytics user just invite us (a single account) to view their analytics, and thus grant our API token access to their data? Identity federate with github repo?

https://github.com/seandavi/ghactions-gcp-example

seandavi commented 2 months ago

Authentication

(HT to ChatGPT, but this jives with what I do in practice, also)

Two methods:

  1. Service account and key file
  2. Service account and workload identify federation to github (more secure)

Service account and Key File

Here's how to do it:

1. Go to Google Cloud Console

2. Create or Select a Project

3. Enable the Google Analytics API

4. Create a Service Account

5. Assign Roles to the Service Account

6. Create a JSON Key for the Service Account

7. Add the Service Account to Google Analytics

8. Use the Service Account in Your Application

Workload Identity Federation

Here's how to set up Workload Identity Federation for authenticating a GitHub Actions workflow to access Google Cloud services:

1. Set Up a Workload Identity Pool

  1. Create a Workload Identity Pool:

    • Go to the Google Cloud Console.
    • Navigate to IAM & Admin > Workload Identity Pools.
    • Click Create Pool.
    • Provide a Name and Description.
    • Select AWS, OIDC, or SAML for the Identity provider type.
    • Click Create to create the pool.
  2. Create a Provider for GitHub:

    • In the Workload Identity Pool, click Add Provider.
    • Select OIDC for the provider type.
    • Enter the following details:
      • Issuer URL: https://token.actions.githubusercontent.com
      • Audience: You can leave this blank (it defaults to the issuer URL).
    • Click Create.

2. Bind the Identity Pool to Your Service Account

  1. Create a Service Account Binding:
    • Go to IAM & Admin > Service Accounts.
    • Select the service account you want to use.
    • Click on the Permissions tab, then Add Member.
    • Add the following principal: principalSet://iam.googleapis.com/{your-pool-id}/attribute.repository/{your-github-repo}
      • Replace {your-pool-id} with the Workload Identity Pool ID.
      • Replace {your-github-repo} with your GitHub repository (e.g., your-org/your-repo).
    • Assign the appropriate roles (e.g., Viewer for reading data from Google Analytics).
    • Click Save.

3. Configure GitHub Actions to Use Workload Identity Federation

  1. Set Up GitHub Secrets:

    • Go to your GitHub repository settings.
    • Under Secrets and variables > Actions, add the following secrets:
      • GCP_PROJECT_ID: Your Google Cloud project ID.
      • GCP_WORKLOAD_IDENTITY_PROVIDER: The full name of the Workload Identity Provider, in the format projects/{project-number}/locations/global/workloadIdentityPools/{pool-id}/providers/{provider-id}.
      • GCP_SERVICE_ACCOUNT_EMAIL: The email of the service account you want to impersonate.
  2. Update Your GitHub Actions Workflow:

    • In your GitHub Actions workflow file (e.g., .github/workflows/your-workflow.yml), add steps to authenticate using the Workload Identity Federation:
    jobs:
     your-job:
       runs-on: ubuntu-latest
       steps:
         - name: Checkout code
           uses: actions/checkout@v2
    
         - name: Authenticate to Google Cloud
           uses: google-github-actions/auth@v1
           with:
             workload_identity_provider: ${{ secrets.GCP_WORKLOAD_IDENTITY_PROVIDER }}
             service_account: ${{ secrets.GCP_SERVICE_ACCOUNT_EMAIL }}
    
         - name: Set up Google Cloud SDK
           uses: google-github-actions/setup-gcloud@v1
           with:
             project_id: ${{ secrets.GCP_PROJECT_ID }}
    
         - name: Run your scraper
           run: python your_script.py

    The google-github-actions/auth action will authenticate using Workload Identity Federation without needing a key file. It will impersonate the specified service account and allow you to access Google Cloud services.

vincerubinetti commented 2 months ago

Thanks for this. For now I'm doing a key file for simplicity. I tested trying to push an inactive key file to the repo and it was properly picked up by GitHubs secret protection and denied, so at least we are protected against that. I'll post more discussion/details soon.

vincerubinetti commented 2 months ago

More discussion:

It seems like workload identity federation is a bit of a hassle to setup locally (this is referencing reddit, it could be wrong). For now, I definitely wanted something I could run and iterate on rapidly, so I opted for a key. At the moment, I've only hooked it up to the Greene Lab's Preprint Similarity Search analytics property. If the key is leaked for whatever reason, they'll only have read access to that (very minimal and low-consequence) data.

As we get more properties hooked up to this, I think we'll definitely want to move to the more secure method. Unless we can somehow add protections that add safety to the key method.

But for now, while we're still iterating and figuring out what data we need, hopefully the key method is okay.

vincerubinetti commented 2 months ago

Discussion of v3 (universal) analytics:

As of July 1st 2024, v3 was shut down: https://support.google.com/analytics/answer/11583528?hl=en

v3 properties were removed from the UI, and (I didn't realize this) the v3 API is no longer accessible either. I tried to find some v3 property I could access to see if I could still export a backup of data or something, but couldn't find one. As far as I can tell, if someone didn't export/backup their v3 data before this date, it is now lost (perhaps short of contacting google directly and asking for it).

It seems safe to assume that if we wanted v3 data, it'd have to be manually sent to us from each DCC as CSV files or something. It might also be the case that this backup data would be inconsistent between DCCs, e.g. one property only has total user metrics saved and another property only has unique users.

As such, we might want to just say that this is not worth the effort.

seandavi commented 2 months ago

I believe there was a CFDE-wide push to migrate to v4 last year. We won't know for sure until we start to ask, but NIH seems to be expecting GA data, so I suspect that most have made the switch.

As for workload identity, that isn't a requirement. More for information. I agree with your approach of simply using and protecting a key file that has limited capabilities anyway.

vincerubinetti commented 2 months ago

Yeah, I'm trying to predict the consequences of that key leaking. A bad actor (let's say one of the three of us owners suddenly becomes evil) could 1) get read-only access to all the CFDE analytics data and 2) do a ton of API requests.

For 1, this would hopefully just be a short amount of time before we realize, revoke the key, and generate a new one. I'd imagine a lot of DCCs wouldn't be happy about this data being leaked. This might be a stretch, but perhaps we could argue that, in the rare event that this happened, the analytics data should be public anyway? It should all be anonymized data I believe? And are all the DCCs required to be fully transparent and open source by nature of being NIH funded?

For 2, I'd think/hope that, if Google noticed suspicious activity from a particular service account, they'd just limit/suspend that account, and it wouldn't have any impact on all the DCC properties that had granted access to that service account. So that would just be an "us" problem.

seandavi commented 2 months ago

@vincerubinetti I followed your excellent instructions and was able to add the service account email to my github.io site analytics. See what you find.

vincerubinetti commented 2 months ago

Here's the data the script was able to pull from your property:

json ```json { "property": "properties/347971380", "propertyName": "seandavi.github.io - GA4", "coreProject": "", "overTime": { "dateRanges": [ { "startDate": "2023-12-01", "endDate": "2024-01-01" }, { "startDate": "2024-01-01", "endDate": "2024-02-01" }, { "startDate": "2024-02-01", "endDate": "2024-03-01" }, { "startDate": "2024-03-01", "endDate": "2024-04-01" }, { "startDate": "2024-04-01", "endDate": "2024-05-01" }, { "startDate": "2024-05-01", "endDate": "2024-06-01" }, { "startDate": "2024-06-01", "endDate": "2024-07-01" }, { "startDate": "2024-07-01", "endDate": "2024-08-01" }, { "startDate": "2024-08-01", "endDate": "2024-09-01" } ], "metrics": [ { "metric": "activeUsers", "values": [ 0, 50, 82, 7, 120, 18, 15, 159, 36 ] }, { "metric": "newUsers", "values": [ 0, 50, 79, 5, 119, 16, 16, 159, 29 ] }, { "metric": "engagedSessions", "values": [ 0, 39, 57, 7, 65, 15, 9, 399, 44 ] } ] }, "topContinents": { "byActiveUsers": { "Americas": 435, "Europe": 23, "Asia": 11, "(not set)": 2, "Africa": 1 }, "byNewUsers": { "Americas": 434, "Europe": 24, "Asia": 11, "(not set)": 2, "Africa": 1 }, "byEngagedSessions": { "Americas": 615, "Europe": 13, "Asia": 4, "Africa": 1 } }, "topCountries": { "byActiveUsers": { "United States": 432, "Singapore": 5, "France": 4, "India": 4, "United Kingdom": 4 }, "byNewUsers": { "United States": 431, "Singapore": 5, "France": 4, "India": 4, "United Kingdom": 4 }, "byEngagedSessions": { "United States": 614, "Italy": 4, "India": 3, "United Kingdom": 3, "France": 2 } }, "topRegions": { "byActiveUsers": { "California": 130, "Virginia": 48, "Washington": 32, "Wyoming": 26, "Colorado": 24 }, "byNewUsers": { "California": 124, "Virginia": 47, "Washington": 32, "Wyoming": 26, "Colorado": 23 }, "byEngagedSessions": { "California": 391, "Maryland": 39, "Virginia": 34, "Colorado": 24, "North Carolina": 17 } }, "topCities": { "byActiveUsers": { "Irvine": 79, "Ashburn": 27, "Cheyenne": 26, "(not set)": 25, "Moses Lake": 23 }, "byNewUsers": { "Irvine": 78, "Ashburn": 27, "Cheyenne": 26, "(not set)": 23, "Moses Lake": 23 }, "byEngagedSessions": { "Irvine": 333, "Los Angeles": 24, "Richmond": 24, "Ellicott City": 19, "Bethesda": 12 } }, "topLanguages": { "byActiveUsers": { "English": 460, "Chinese": 3, "French": 2, "Spanish": 2, "Hungarian": 1 }, "byNewUsers": { "English": 461, "Chinese": 3, "French": 2, "Spanish": 2, "Hungarian": 1 }, "byEngagedSessions": { "English": 622, "Chinese": 4, "French": 2, "Hungarian": 1, "Italian": 1 } }, "topDevices": { "byActiveUsers": { "desktop": 366, "mobile": 106 }, "byNewUsers": { "desktop": 367, "mobile": 106 }, "byEngagedSessions": { "desktop": 581, "mobile": 51 } }, "topOSes": { "byActiveUsers": { "Windows": 237, "Macintosh": 121, "iOS": 92, "Android": 14, "Linux": 8 }, "byNewUsers": { "Windows": 238, "Macintosh": 121, "iOS": 92, "Android": 14, "Linux": 8 }, "byEngagedSessions": { "Windows": 377, "Macintosh": 198, "iOS": 42, "Android": 9, "Linux": 7 } } } ```

The core project field is empty, meaning I couldn't find the project associated with it. Did you do that part of the readme? It'll need the project number to show up in the dashboard UI and pdfs.

seandavi commented 2 months ago

I just wanted to test the authentication piece, but I've just added a Key Event with the CONNECT project. See how that flies.

seandavi commented 2 months ago

By the way, this "deliverable" of collecting GA 4 data centrally is something that I suspect many NIH ICs and POs will be interested to hear about.

vincerubinetti commented 2 months ago

Just pushed a re-run update to main. Check it out here: https://cfde-eval.netlify.app/core-project/U54OD036472 image

seandavi commented 2 months ago

I'd call this one done, closed by your last PR, #22.

cgreene commented 2 months ago

I think this is going to be exciting to our program folks at NIH!