Closed junjun-zhang closed 6 years ago
Query to extract all specimen id's from donor docs:
curl -XGET "http://192.168.0.189:9200/icgc26-27/_search" -H 'Content-Type: application/json' -d'
{
"query": {
"bool": {
"must": [
{
"match": {
"_type": "donor"
}
},
{
"match": {
"_summary.repository": "CGHub"
}
}
]
}
},
"_source": "specimen.specimen_id"
}
jq '.hits | .hits | .[]._source.specimen | .[].specimen_id?'
import requests, json
url = 'https://api.gdc.cancer.gov/v0/legacy/cases/ids?query='
with open('filtered-ids.txt') as f:
ids = f.readlines()
with open('gdc-ids.txt', 'wb') as f:
for item in [json.loads(requests.get(url + x.strip('\"\n')).text)["data"]["hits"][0]["id"] for x in ids[:10]]:
f.write("%s\n" % item)
import requests, json
url = 'https://api.gdc.cancer.gov/v0/legacy/cases/ids?query='
limit = 10
with open('filtered-ids.txt') as f:
ids = f.readlines()
with open('gdc-ids.txt', 'wb') as f:
for item in [json.loads(requests.get(url + x.strip('\"\n')).text)["data"]["hits"][0]["id"] for x in ids[:limit]]:
id = x.strip('\"\n')
url = "https://portal.gdc.cancer.gov/legacy-archive/search/f?filters=%7B%22op%22:%22and%22,%22content%22:%5B%7B%22op%22:%22in%22,%22content%22:%7B%22field%22:%22cases.case_id%22,%22value%22:%5B%22{0}%22%5D%7D%7D%5D%7D".format(item)
f.write("%s, %s\n" % (id, url))
Case ID is Donor Id ...
GET icgc26-27/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"_type": "donor"
}
},
{
"match": {
"_summary.repository": "CGHub"
}
}
]
}
},
"_source": "donor_id"
}
Final solution will be a front-end update only, URL format for GDC Legacy portal is as follows:
https://portal.gdc.cancer.gov/legacy-archive/search/f?filters={"op":"and","content":[{"op":"in","content":{"field":"cases.submitter_id","value":["###_DONOR_ID__###"]}}]}
Discussion with @junjun-zhang :
When the link is created, always use the criteria data_format in (BAM, FASTQ) and data_category is (Raw Sequencing Data). For each link, append to this the caseid (submitterid) and the analysis categories that are linked to that sample. This will lead to some overlap for some samples, but that is ok.
Example: https://dcc.icgc.org/donors/DO37425 sample TCGA-EE-A2M5-06A-12D-A18Y-02 leads to:
https://portal.gdc.cancer.gov/legacy-archive/search/f?filters={"op":"and","content":[{"op":"in","content":{"field":"files.data_format","value":["BAM","FASTQ"]}},{"op":"in","content":{"field":"files.data_category","value":["Raw sequencing data"]}},{"op":"in","content":{"field":"cases.case_id","value":["b5e37b9b-6264-4cc8-9b44-1d8ac0692d6c"]}},{"op":"in","content":{"field":"files.experimental_strategy","value":["WGS"]}}]}
On staging: https://staging.dcc.icgc.org/donors/DO37425
@lepsalex Junjun and I both checked it! No issues with testing - we are good to go ahead.
On Donor page at the bottom we list all samples and sequence data associated with them. As CGHub has long been stopped, we need to point to the new data repo replacing CGHub, ie, GDC.
To link to GDC, we will use TCGA's aliquot UUID which is preserved in GDC and can be used to search for associated sequence files. The aliquot UUIDs are included in TCGA submission in the
ssm_m
metadata file.