ncihtan / htan-portal

The HTAN Data Portal
https://humantumoratlas.org
9 stars 11 forks source link

Use BigQuery tables to pull releasable entity lists #534

Open clarisse-lau opened 1 year ago

clarisse-lau commented 1 year ago

Currently the portal script get_syn_data.py takes a number of json files as input to indicate entity release status. I have pulled information from all of the below json files into two BigQuery tables: htan-dcc.released.entities and htan-dcc.released.metadata .

These tables will be used as a single source of truth for file-level release status for other use cases (including loading BQ/ISB-CGC metadata tables, and CDS metadata template generation), and I am hoping the portal script can also be updated to use these tables rather than a growing number of lists.

The tables are intended to replace the following json files:

It also pulls released IDC entities from data/idc-imaging-assets.json

Going forward, data release scripts will append synapse IDs to these tables (e.g. adding rows with Data_Release = 'Release 4.0') rather than generating additional release4_include.json, etc files.

Screenshot 2023-08-10 at 9 53 03 AM Screenshot 2023-08-10 at 9 53 25 AM

clarisse-lau commented 1 year ago

Below are examples of how the tables can be directly queried using Python. I'm happy to help with configuration, minting a service account key file, etc as needed - authentication instructions can be found on Google Cloud Docs

from google.cloud import bigquery
client = bigquery.Client()

include_release_ids = list(client.query("""
   SELECT entityId FROM `htan-dcc.released.entities` 
   WHERE Data_Release IS NOT NULL
""").result().to_dataframe()['entityId'])

# release3_synapse_metadata.json # table also contains manifestVersion
# SELECT Manifest_Id FROM `htan-dcc.released.metadata` 
# WHERE Data_Release = 'Release 3.0'

# dbgap_release_all.json 
# SELECT entityId FROM `htan-dcc.released.entities` 
# WHERE CDS_Release IS NOT NULL
onursumer commented 1 year ago

It's safe to remove release#_xx.json from the repo now, but looks like we still rely on data/dbgap_release_all.json, dbgap_img_release2.json, and idc-imaging-assets.json.

https://github.com/ncihtan/htan-portal/blob/97fbb2d50ec85c2a634991737e441643605ea608/data/processSynapseJSON.ts#L27-L29

Currently we only get released.entities_v4 and released.metadata_v4 from big query. I think we will need to define separate queries to get the remaining data if we want to remove those file from the repo as well.

clarisse-lau commented 1 year ago

Thanks @onursumer and @inodb for switching us over to the BQ tables for HTAN!

Just to clarify- information from dbgap_release1.json, dbgap_img_release2.json, dbgap_release_all.json files are included in the released.entities ( = released.entities_v4) table.

As shown in one of the example queries above, the CDS_Release column indicates a file's release on CDS/dbGaP.

# dbgap_release_all.json 
# SELECT entityId FROM `htan-dcc.released.entities` 
# WHERE CDS_Release IS NOT NULL

But since you're pulling the full table into entities_v4.csv, you can probably use this file rather than needing to define another query?