Open clarisse-lau opened 1 year ago
Below are examples of how the tables can be directly queried using Python. I'm happy to help with configuration, minting a service account key file, etc as needed - authentication instructions can be found on Google Cloud Docs
from google.cloud import bigquery
client = bigquery.Client()
include_release_ids = list(client.query("""
SELECT entityId FROM `htan-dcc.released.entities`
WHERE Data_Release IS NOT NULL
""").result().to_dataframe()['entityId'])
# release3_synapse_metadata.json # table also contains manifestVersion
# SELECT Manifest_Id FROM `htan-dcc.released.metadata`
# WHERE Data_Release = 'Release 3.0'
# dbgap_release_all.json
# SELECT entityId FROM `htan-dcc.released.entities`
# WHERE CDS_Release IS NOT NULL
It's safe to remove release#_xx.json
from the repo now, but looks like we still rely on data/dbgap_release_all.json
, dbgap_img_release2.json
, and idc-imaging-assets.json
.
Currently we only get released.entities_v4
and released.metadata_v4
from big query. I think we will need to define separate queries to get the remaining data if we want to remove those file from the repo as well.
Thanks @onursumer and @inodb for switching us over to the BQ tables for HTAN!
Just to clarify- information from dbgap_release1.json, dbgap_img_release2.json, dbgap_release_all.json files are included in the released.entities
( = released.entities_v4
) table.
As shown in one of the example queries above, the CDS_Release column indicates a file's release on CDS/dbGaP.
# dbgap_release_all.json
# SELECT entityId FROM `htan-dcc.released.entities`
# WHERE CDS_Release IS NOT NULL
But since you're pulling the full table into entities_v4.csv, you can probably use this file rather than needing to define another query?
Currently the portal script get_syn_data.py takes a number of json files as input to indicate entity release status. I have pulled information from all of the below json files into two BigQuery tables:
htan-dcc.released.entities
andhtan-dcc.released.metadata
.These tables will be used as a single source of truth for file-level release status for other use cases (including loading BQ/ISB-CGC metadata tables, and CDS metadata template generation), and I am hoping the portal script can also be updated to use these tables rather than a growing number of lists.
The tables are intended to replace the following json files:
It also pulls released IDC entities from
data/idc-imaging-assets.json
Going forward, data release scripts will append synapse IDs to these tables (e.g. adding rows with Data_Release = 'Release 4.0') rather than generating additional release4_include.json, etc files.