Replace 'Sequence Repo' label for CGHub by URL link to GDC legacy portal

icgc-dcc / dcc-portal

Data portal for exploring and accessing data

https://dcc.icgc.org/

Other

15 stars 8 forks source link

Replace 'Sequence Repo' label for CGHub by URL link to GDC legacy portal #498

Closed junjun-zhang closed 6 years ago

junjun-zhang commented 6 years ago

On Donor page at the bottom we list all samples and sequence data associated with them. As CGHub has long been stopped, we need to point to the new data repo replacing CGHub, ie, GDC.

To link to GDC, we will use TCGA's aliquot UUID which is preserved in GDC and can be used to search for associated sequence files. The aliquot UUIDs are included in TCGA submission in the ssm_m metadata file.

lepsalex commented 6 years ago

Query to extract all specimen id's from donor docs:

curl -XGET "http://192.168.0.189:9200/icgc26-27/_search" -H 'Content-Type: application/json' -d'
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "_type": "donor"
          }
        },
        {
          "match": {
            "_summary.repository": "CGHub"
          }
        }
      ]
    }
  },
  "_source": "specimen.specimen_id"
}

lepsalex commented 6 years ago

jq '.hits | .hits | .[]._source.specimen | .[].specimen_id?'

lepsalex commented 6 years ago

import requests, json

url = 'https://api.gdc.cancer.gov/v0/legacy/cases/ids?query='

with open('filtered-ids.txt') as f:
    ids = f.readlines()

with open('gdc-ids.txt', 'wb') as f:
    for item in [json.loads(requests.get(url + x.strip('\"\n')).text)["data"]["hits"][0]["id"] for x in ids[:10]]:
        f.write("%s\n" % item)

lepsalex commented 6 years ago

import requests, json

url = 'https://api.gdc.cancer.gov/v0/legacy/cases/ids?query='
limit = 10

with open('filtered-ids.txt') as f:
    ids = f.readlines()

with open('gdc-ids.txt', 'wb') as f:
    for item in [json.loads(requests.get(url + x.strip('\"\n')).text)["data"]["hits"][0]["id"] for x in ids[:limit]]:
        id = x.strip('\"\n')
        url = "https://portal.gdc.cancer.gov/legacy-archive/search/f?filters=%7B%22op%22:%22and%22,%22content%22:%5B%7B%22op%22:%22in%22,%22content%22:%7B%22field%22:%22cases.case_id%22,%22value%22:%5B%22{0}%22%5D%7D%7D%5D%7D".format(item)
        f.write("%s, %s\n" % (id, url))

lepsalex commented 6 years ago

Case ID is Donor Id ...

GET icgc26-27/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "_type": "donor"
          }
        },
        {
          "match": {
            "_summary.repository": "CGHub"
          }
        }
      ]
    }
  },
  "_source": "donor_id"
}

lepsalex commented 6 years ago

Final solution will be a front-end update only, URL format for GDC Legacy portal is as follows:

https://portal.gdc.cancer.gov/legacy-archive/search/f?filters={"op":"and","content":[{"op":"in","content":{"field":"cases.submitter_id","value":["###_DONOR_ID__###"]}}]}

rosibaj commented 6 years ago

Discussion with @junjun-zhang :

When the link is created, always use the criteria data_format in (BAM, FASTQ) and data_category is (Raw Sequencing Data). For each link, append to this the caseid (submitterid) and the analysis categories that are linked to that sample. This will lead to some overlap for some samples, but that is ok.

Example: https://dcc.icgc.org/donors/DO37425 sample TCGA-EE-A2M5-06A-12D-A18Y-02 leads to:

https://portal.gdc.cancer.gov/legacy-archive/search/f?filters={"op":"and","content":[{"op":"in","content":{"field":"files.data_format","value":["BAM","FASTQ"]}},{"op":"in","content":{"field":"files.data_category","value":["Raw sequencing data"]}},{"op":"in","content":{"field":"cases.case_id","value":["b5e37b9b-6264-4cc8-9b44-1d8ac0692d6c"]}},{"op":"in","content":{"field":"files.experimental_strategy","value":["WGS"]}}]}

lepsalex commented 6 years ago

On staging: https://staging.dcc.icgc.org/donors/DO37425

rosibaj commented 6 years ago

@lepsalex Junjun and I both checked it! No issues with testing - we are good to go ahead.