icgc-dcc / dcc-portal

Data portal for exploring and accessing data
https://dcc.icgc.org/
Other
15 stars 8 forks source link

PCAWG file missing metadata #514

Closed junjun-zhang closed 5 years ago

junjun-zhang commented 6 years ago

See the screenshot below. Not sure what made this file special, it's a PCAWG file, but it does not have Data Type or Analysis Software assigned for some reason.

image

Also, if you follow the link (https://dcc.icgc.org/repositories/files/FI206869) to the File, it gives you a page with lots of missing fields.

rtisma commented 6 years ago

When visiting that page, if you inspect it, you can find the Portal API response for the file request, which is shown at the bottom of this comment. The first thing that is obvious, is the empty dataCategorization and analysisMethod fields. By going to the song-analysis-read api, and entering the studyId LIRI-JP and the id fe1bd829-6d0f-45c3-8594-85d4d5d13d49 , you can observe that the alignmentTool and libraryStrategy are properly defined in SONG, meaning this could be a dcc-repository issue. I will take a look at why dcc-repository is not properly indexing this analysis even though its record is complete.

ReadAnalysis fe1bd829-6d0f-45c3-8594-85d4d5d13d49

Curl command

curl -X GET --header 'Accept: application/json' 'https://song.cancercollaboratory.org/studies/LIRI-JP/analysis/fe1bd829-6d0f-45c3-8594-85d4d5d13d49'

Response

{
  "analysisType": "sequencingRead",
  "info": {},
  "analysisId": "fe1bd829-6d0f-45c3-8594-85d4d5d13d49",
  "study": "LIRI-JP",
  "analysisState": "PUBLISHED",
  "sample": [
    {
      "info": {},
      "sampleId": "SA560659",
      "specimenId": "SP98913",
      "sampleSubmitterId": "RK007_Cancer",
      "sampleType": "RNA",
      "specimen": {
        "info": {},
        "specimenId": "SP98913",
        "donorId": "DO45097",
        "specimenSubmitterId": "RK007_C01",
        "specimenClass": "Tumour",
        "specimenType": "Primary tumour - solid tissue"
      },
      "donor": {
        "donorId": "DO45097",
        "donorSubmitterId": "RK007",
        "studyId": "LIRI-JP",
        "donorGender": "male",
        "info": {}
      }
    }
  ],
  "file": [
    {
      "info": {},
      "objectId": "5b91da7b-379d-5beb-9d2f-953795e9024c",
      "analysisId": "fe1bd829-6d0f-45c3-8594-85d4d5d13d49",
      "fileName": "PCAWG.5f66b9a6-e1f3-11e4-8f30-aabb917fcde1.STAR.v1.bam",
      "studyId": "LIRI-JP",
      "fileSize": 5208225488,
      "fileType": "BAM",
      "fileMd5sum": "50c9d46fced666fc3b8b45c6a004bb96",
      "fileAccess": "controlled"
    },
    {
      "info": {},
      "objectId": "dccb107d-7d98-5ba2-8ba1-0179a480eb03",
      "analysisId": "fe1bd829-6d0f-45c3-8594-85d4d5d13d49",
      "fileName": "fe1bd829-6d0f-45c3-8594-85d4d5d13d49.xml",
      "studyId": "LIRI-JP",
      "fileSize": 22598,
      "fileType": "XML",
      "fileMd5sum": "eb388a103459b2e8029aa9b953e76930",
      "fileAccess": "controlled"
    }
  ],
  "experiment": {
    "analysisId": "fe1bd829-6d0f-45c3-8594-85d4d5d13d49",
    "aligned": true,
    "alignmentTool": "STAR",
    "insertSize": -1,
    "libraryStrategy": "RNA-Seq",
    "pairedEnd": false,
    "referenceGenome": "GRCh37",
    "info": {}
  }
}

Portal Response

{
   "id":"FI206869",
   "objectId":"5b91da7b-379d-5beb-9d2f-953795e9024c",
   "access":"controlled",
   "study":[
      "PCAWG"
   ],
   "dataCategorization":{

   },
   "dataBundle":{
      "dataBundleId":"fe1bd829-6d0f-45c3-8594-85d4d5d13d49"
   },
   "fileCopies":[
      {
         "repoDataBundleId":"EGAZ00001312244",
         "repoFileId":"EGAF00001719363",
         "repoDataSetIds":[
            "EGAD00001003547"
         ],
         "repoCode":"ega",
         "repoOrg":"EGA",
         "repoName":"EGA - Hinxton",
         "repoType":"EGA",
         "repoCountry":"UK",
         "repoBaseUrl":"http://ega.ebi.ac.uk/ega/",
         "repoDataPath":"",
         "repoMetadataPath":"/rest/download/v2/metadata/",
         "indexFile":{
            "id":"FI720618",
            "objectId":"1249a65b-2d4b-5610-a981-e4a8604f862a",
            "fileName":"fe1bd829-6d0f-45c3-8594-85d4d5d13d49/PCAWG.5f66b9a6-e1f3-11e4-8f30-aabb917fcde1.STAR.v1.bam.bai",
            "fileFormat":"BAI",
            "fileMd5sum":"e4028e4c7c17eae21d0892f49ddeaa95"
         },
         "fileName":"PCAWG.5f66b9a6-e1f3-11e4-8f30-aabb917fcde1.STAR.v1.bam",
         "fileFormat":"BAM",
         "fileMd5sum":"b43a6e975058f941251a7dad583e25cc",
         "fileSize":5208225488,
         "lastModified":1501787598
      },
      {
         "repoDataBundleId":"fe1bd829-6d0f-45c3-8594-85d4d5d13d49",
         "repoFileId":"5b91da7b-379d-5beb-9d2f-953795e9024c",
         "repoDataSetIds":[

         ],
         "repoCode":"aws-virginia",
         "repoOrg":"AWS",
         "repoName":"AWS - Virginia",
         "repoType":"S3",
         "repoCountry":"US",
         "repoBaseUrl":"https://s3-external-1.amazonaws.com/",
         "repoDataPath":"/oicr.icgc/data/5b91da7b-379d-5beb-9d2f-953795e9024c",
         "repoMetadataPath":"/oicr.icgc.meta/metadata/dccb107d-7d98-5ba2-8ba1-0179a480eb03",
         "indexFile":{

         },
         "fileName":"PCAWG.5f66b9a6-e1f3-11e4-8f30-aabb917fcde1.STAR.v1.bam",
         "fileFormat":"BAM",
         "fileMd5sum":"50c9d46fced666fc3b8b45c6a004bb96",
         "fileSize":5208225488,
         "lastModified":1473578198
      },
      {
         "repoDataBundleId":"fe1bd829-6d0f-45c3-8594-85d4d5d13d49",
         "repoFileId":"5b91da7b-379d-5beb-9d2f-953795e9024c",
         "repoDataSetIds":[

         ],
         "repoCode":"collaboratory",
         "repoName":"Collaboratory - Toronto",
         "repoType":"S3",
         "repoCountry":"CA",
         "repoBaseUrl":"https://www.cancercollaboratory.org:9080/",
         "repoDataPath":"/oicr.icgc/data/5b91da7b-379d-5beb-9d2f-953795e9024c",
         "repoMetadataPath":"/oicr.icgc.meta/metadata/dccb107d-7d98-5ba2-8ba1-0179a480eb03",
         "indexFile":{

         },
         "fileName":"PCAWG.5f66b9a6-e1f3-11e4-8f30-aabb917fcde1.STAR.v1.bam",
         "fileFormat":"BAM",
         "fileMd5sum":"50c9d46fced666fc3b8b45c6a004bb96",
         "fileSize":5208225488
      }
   ],
   "donors":[
      {
         "donorId":"DO45097",
         "primarySite":"Liver",
         "projectCode":"LIRI-JP",
         "study":"PCAWG",
         "sampleId":[
            "SA560659"
         ],
         "specimenId":[
            "SP98913"
         ],
         "specimenType":[
            "Primary tumour - solid tissue"
         ],
         "submittedDonorId":"RK007",
         "submittedSampleId":[
            "RK007_Cancer"
         ],
         "submittedSpecimenId":[
            "RK007_C01"
         ]
      }
   ],
   "referenceGenome":{
      "genomeBuild":"GRCh37",
      "referenceName":"hs37d5",
      "downloadUrl":"ftp://ftp.sanger.ac.uk/pub/project/PanCancer/genome.fa.gz"
   },
   "analysisMethod":{

   }
}
junjun-zhang commented 6 years ago

@rtisma any update on this?

It seems there are more such kind of files (was 1, now 83), see screenshot below: image

rtisma commented 6 years ago

@junjun-zhang please review the link below https://dcc.icgc.org/repositories?filters=%7B%22file%22:%7B%22study%22:%7B%22is%22:%5B%22PCAWG%22%5D%7D,%22software%22:%7B%22is%22:%5B%22_missing%22%5D%7D%7D%7D&files=%7B%22from%22:1,%22size%22:25%7D

there seems to be 3000+ files now using the same filter as in your description. I still see the data type and software missing for https://dcc.icgc.org/repositories/files/FI206869

so i will continue looking into that.

rosibaj commented 6 years ago

Discussion in planning: Looking at the portal, files with the correct data types are held in GNOS repositories. Metadata was being inherited from GNOS metadata - 2 GNOS repositories have been removed and the metadata associated with the removed repositories is not being carried into the portal now that they have been removed.

Solution: May need to add data. Kevin to determine!

rtisma commented 6 years ago

Issue seems to be https://github.com/icgc-dcc/dcc-repository/blob/develop/dcc-repository-client/src/main/java/org/icgc/dcc/repository/client/core/RepositoryFileCombiner.java#L141

in the merge step of the dcc-repository-client, the order of the files dictates which value will be used. if 1 repo has a different value than the other repos for the same file, and if that repo is processed first, its value will be merged. This is a byproduct of no conflict resolution for dcc-repository.

rtisma commented 5 years ago

As another example https://dcc.icgc.org/repositories/files/FI815862

i missing the software field. After lookiong at the associated analysis object from song:

curl -X GET --header 'Accept: application/json' 'https://song.cancercollaboratory.org/studies/RECA-EU/analysis/d6ea916a-609e-42fd-be75-571cdce4f592'

i see that software is also missing on song, and since song-aws and song-collab both dont have the field, its clear why the field is missing.