Staging environment - IGV API calls returning 403 forbidden

Describe the bug When viewing CRAMs through IGV js in seqr, the IGV API streams the CRAM data from the bucket using a gcloud auth access token. For the seqr-staging deployment, the service accounts should have access to data stored in the test buckets. However when we try to access CRAMs in the test buckets through IGV js, we see an "Access forbidden" error with a 403 status code.

From the GCP compute engine logs:

"timestamp": "2024-07-03 02:30:10,174", 
"severity": "WARNING", 
"httpRequest": {
    "requestMethod": "GET", 
    "requestUrl": "http://seqr-staging.populationgenomics.org.au/api/project/Project_GUID/igv_track/gs://cpg-dataset-test/cram/CPGxxxxx.cram?someRandomSeed=0.ndrkvah12bi", 
    "status": 403, 
    "responseSize": null, 
    ...
    "referer": "https://seqr-staging.populationgenomics.org.au/project/Project_GUID/project_page", 
    "protocol": "HTTP/1.1"
}, 
"user": "edward.formaini@populationgenomics.org.au"
...
"message": "Forbidden: /api/project/Project_GUID/igv_track/gs://cpg-dataset-test/cram/CPGxxxxx.cram"

Interestingly, the response "Preview" indicates that the seqr-prod service account is in use here.

Access denied.
seqr-prod@seqr-308602.iam.gserviceaccount.com does not have storage.objects.get access to the Google Cloud Storage object. Permission 'storage.objects.get' denied on resource (or it may not exist).

Link to page(s) where bug is occurring Try viewing the reads for this family. The error does not appear immediately, but once you select a region of the genome to view and zoom in sufficiently to the point where the API call is made to stream the CRAM, the error will appear.

Scope of the bug This seems to impact all projects in the staging deployment when we try to load reads from the test buckets. However, when loading a reads from the main bucket in the staging deployment, the issue is resolved. It seems that because the staging environment is using the seqr-prod service account, it can access files in the main buckets but not the test buckets. See this family, which has a read file from the main bucket loaded.

In the production deployment of seqr, this error is not seen. The requests are formulated in the same way, but return success with partial content (206) i.e.


"httpRequest": {
    "requestMethod": "GET", 
    "requestUrl": "http://seqr.populationgenomics.org.au/api/project/Project_GUID/igv_track/gs://cpg-dataset-main/cram/CPGxxxxx.cram?someRandomSeed=0.ndrkvah12bi", 
    "status": 206, 
    ...
},

Update on this:

The behaviour has become even stranger. After revisiting this one hour after drafting the initial issue, I went back to the staging deployment to see the CRAM from main, which was previously working, is now returning 403 forbidden.

Meanwhile, the CRAMs from test, which were previously returning forbidden, are now returning 206 OK!

Looking at the response preview in the developer console, it says that the GET requests are now using the seqr-staging service account instead of the seqr-prod account?

Access denied.
seqr-staging@seqr-308602.iam.gserviceaccount.com does not have storage.objects.get access to the Google Cloud Storage object. Permission 'storage.objects.get' denied on resource (or it may not exist).

Reviewing the compute engine logs, the staging VM did NOT restart at any point in between these two interactions. Yet, it seems as though the VM is sometimes choosing to use the seqr-prod service account, while at other times it uses the seqr-staging account. This leads to inconsistent access to the buckets depending on if the data is in test or main.

@EddieLF Oh my god, I just found out what's happening. It's caching the access token in redis, and we share the redis instance across prod / staging instances. The cache_key is the same across instances, hence it's cached for all copies of seqr that share that same redis instance.

https://github.com/populationgenomics/seqr/blob/12024b4f41e021a3b26208a745483b29855750a9/seqr/views/apis/igv_api.py#L237-L245

Options:

Creating a separate redis instance, could be reasonable as there are a few things that are cached,
Add some prefix to each cache-key, so we can share the same instance and have those keys scoped,
Only cache the access token on the instance that's being created (and not storing the credential in a database), but you could theoretically get unlucky, and hit different instances each time you ask for some data from IGV.

This is not where I was expecting this to be. I checked persistent disk storage (because disk names matched across instance templates), recreated instance templates and deployed new instance groups to no avail...

@illusional nice job figuring this out so quick! 💯

I had some idea that it could have been due to the load balancing but nothing made sense considering the GSA_KEY was fundamentally different in the dev/prod environments. However the fact that redis is accessed first to get the SA key before trying to get it from the gcloud auth bash command makes a lot of sense as to why that did not matter.

I'm not too knowledgeable about redis so I'm not sure what's best here. If setting up another redis instance is easy enough, then that could be fine. As you mention, we could also add a prefix to each key based on the DEPLOYMENT_TYPE setting, which will be 'prod' or 'dev' for these environments.

So we might want to re define the global variable in igv_api.py as

from settings import DEPLOYMENT_TYPE

GS_STORAGE_ACCESS_CACHE_KEY = DEPLOYMENT_TYPE + '_gs_storage_access_cache_entry'

Is it easy enough to update the cached keys in redis so that we have prod / dev SA tokens stored separately and accessible with the keys 'prod_gs_storage_access_cache_entry' and 'dev_gs_storage_access_cache_entry'?

populationgenomics / seqr

Staging environment - IGV API calls returning 403 forbidden #227