Closed seandavi closed 2 months ago
Assuming you meant this for our dashboard itself, and not any webapps associated with CF projects. That'd be on them to add their own analytics, though we can have this issue track adding a submittal process and ingest of their analytics tokens.
Look into credential technicalities. Do we need to get a token for every analytics account? Can each analytics user just invite us (a single account) to view their analytics, and thus grant our API token access to their data? Identity federate with github repo?
(HT to ChatGPT, but this jives with what I do in practice, also)
Two methods:
Here's how to do it:
Here's how to set up Workload Identity Federation for authenticating a GitHub Actions workflow to access Google Cloud services:
Create a Workload Identity Pool:
Create a Provider for GitHub:
https://token.actions.githubusercontent.com
principalSet://iam.googleapis.com/{your-pool-id}/attribute.repository/{your-github-repo}
{your-pool-id}
with the Workload Identity Pool ID.{your-github-repo}
with your GitHub repository (e.g., your-org/your-repo
).Set Up GitHub Secrets:
GCP_PROJECT_ID
: Your Google Cloud project ID.GCP_WORKLOAD_IDENTITY_PROVIDER
: The full name of the Workload Identity Provider, in the format projects/{project-number}/locations/global/workloadIdentityPools/{pool-id}/providers/{provider-id}
.GCP_SERVICE_ACCOUNT_EMAIL
: The email of the service account you want to impersonate.Update Your GitHub Actions Workflow:
.github/workflows/your-workflow.yml
), add steps to authenticate using the Workload Identity Federation:jobs:
your-job:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v2
- name: Authenticate to Google Cloud
uses: google-github-actions/auth@v1
with:
workload_identity_provider: ${{ secrets.GCP_WORKLOAD_IDENTITY_PROVIDER }}
service_account: ${{ secrets.GCP_SERVICE_ACCOUNT_EMAIL }}
- name: Set up Google Cloud SDK
uses: google-github-actions/setup-gcloud@v1
with:
project_id: ${{ secrets.GCP_PROJECT_ID }}
- name: Run your scraper
run: python your_script.py
The google-github-actions/auth
action will authenticate using Workload Identity Federation without needing a key file. It will impersonate the specified service account and allow you to access Google Cloud services.
Thanks for this. For now I'm doing a key file for simplicity. I tested trying to push an inactive key file to the repo and it was properly picked up by GitHubs secret protection and denied, so at least we are protected against that. I'll post more discussion/details soon.
More discussion:
It seems like workload identity federation is a bit of a hassle to setup locally (this is referencing reddit, it could be wrong). For now, I definitely wanted something I could run and iterate on rapidly, so I opted for a key. At the moment, I've only hooked it up to the Greene Lab's Preprint Similarity Search analytics property. If the key is leaked for whatever reason, they'll only have read access to that (very minimal and low-consequence) data.
As we get more properties hooked up to this, I think we'll definitely want to move to the more secure method. Unless we can somehow add protections that add safety to the key method.
But for now, while we're still iterating and figuring out what data we need, hopefully the key method is okay.
Discussion of v3 (universal) analytics:
As of July 1st 2024, v3 was shut down: https://support.google.com/analytics/answer/11583528?hl=en
v3 properties were removed from the UI, and (I didn't realize this) the v3 API is no longer accessible either. I tried to find some v3 property I could access to see if I could still export a backup of data or something, but couldn't find one. As far as I can tell, if someone didn't export/backup their v3 data before this date, it is now lost (perhaps short of contacting google directly and asking for it).
It seems safe to assume that if we wanted v3 data, it'd have to be manually sent to us from each DCC as CSV files or something. It might also be the case that this backup data would be inconsistent between DCCs, e.g. one property only has total user metrics saved and another property only has unique users.
As such, we might want to just say that this is not worth the effort.
I believe there was a CFDE-wide push to migrate to v4 last year. We won't know for sure until we start to ask, but NIH seems to be expecting GA data, so I suspect that most have made the switch.
As for workload identity, that isn't a requirement. More for information. I agree with your approach of simply using and protecting a key file that has limited capabilities anyway.
Yeah, I'm trying to predict the consequences of that key leaking. A bad actor (let's say one of the three of us owners suddenly becomes evil) could 1) get read-only access to all the CFDE analytics data and 2) do a ton of API requests.
For 1, this would hopefully just be a short amount of time before we realize, revoke the key, and generate a new one. I'd imagine a lot of DCCs wouldn't be happy about this data being leaked. This might be a stretch, but perhaps we could argue that, in the rare event that this happened, the analytics data should be public anyway? It should all be anonymized data I believe? And are all the DCCs required to be fully transparent and open source by nature of being NIH funded?
For 2, I'd think/hope that, if Google noticed suspicious activity from a particular service account, they'd just limit/suspend that account, and it wouldn't have any impact on all the DCC properties that had granted access to that service account. So that would just be an "us" problem.
@vincerubinetti I followed your excellent instructions and was able to add the service account email to my github.io site analytics. See what you find.
Here's the data the script was able to pull from your property:
The core project field is empty, meaning I couldn't find the project associated with it. Did you do that part of the readme? It'll need the project number to show up in the dashboard UI and pdfs.
I just wanted to test the authentication piece, but I've just added a Key Event with the CONNECT project. See how that flies.
By the way, this "deliverable" of collecting GA 4 data centrally is something that I suspect many NIH ICs and POs will be interested to hear about.
Just pushed a re-run update to main. Check it out here: https://cfde-eval.netlify.app/core-project/U54OD036472
I'd call this one done, closed by your last PR, #22.
I think this is going to be exciting to our program folks at NIH!
Just so we capture what we can. Obviously nothing's ready for primetime just yet, but....