Closed junjun-zhang closed 5 years ago
When visiting that page, if you inspect it, you can find the Portal API response for the file request, which is shown at the bottom of this comment. The first thing that is obvious, is the empty dataCategorization
and analysisMethod
fields. By going to the song-analysis-read api, and entering the studyId LIRI-JP
and the id fe1bd829-6d0f-45c3-8594-85d4d5d13d49
, you can observe that the alignmentTool
and libraryStrategy
are properly defined in SONG, meaning this could be a dcc-repository issue. I will take a look at why dcc-repository is not properly indexing this analysis even though its record is complete.
curl -X GET --header 'Accept: application/json' 'https://song.cancercollaboratory.org/studies/LIRI-JP/analysis/fe1bd829-6d0f-45c3-8594-85d4d5d13d49'
{
"analysisType": "sequencingRead",
"info": {},
"analysisId": "fe1bd829-6d0f-45c3-8594-85d4d5d13d49",
"study": "LIRI-JP",
"analysisState": "PUBLISHED",
"sample": [
{
"info": {},
"sampleId": "SA560659",
"specimenId": "SP98913",
"sampleSubmitterId": "RK007_Cancer",
"sampleType": "RNA",
"specimen": {
"info": {},
"specimenId": "SP98913",
"donorId": "DO45097",
"specimenSubmitterId": "RK007_C01",
"specimenClass": "Tumour",
"specimenType": "Primary tumour - solid tissue"
},
"donor": {
"donorId": "DO45097",
"donorSubmitterId": "RK007",
"studyId": "LIRI-JP",
"donorGender": "male",
"info": {}
}
}
],
"file": [
{
"info": {},
"objectId": "5b91da7b-379d-5beb-9d2f-953795e9024c",
"analysisId": "fe1bd829-6d0f-45c3-8594-85d4d5d13d49",
"fileName": "PCAWG.5f66b9a6-e1f3-11e4-8f30-aabb917fcde1.STAR.v1.bam",
"studyId": "LIRI-JP",
"fileSize": 5208225488,
"fileType": "BAM",
"fileMd5sum": "50c9d46fced666fc3b8b45c6a004bb96",
"fileAccess": "controlled"
},
{
"info": {},
"objectId": "dccb107d-7d98-5ba2-8ba1-0179a480eb03",
"analysisId": "fe1bd829-6d0f-45c3-8594-85d4d5d13d49",
"fileName": "fe1bd829-6d0f-45c3-8594-85d4d5d13d49.xml",
"studyId": "LIRI-JP",
"fileSize": 22598,
"fileType": "XML",
"fileMd5sum": "eb388a103459b2e8029aa9b953e76930",
"fileAccess": "controlled"
}
],
"experiment": {
"analysisId": "fe1bd829-6d0f-45c3-8594-85d4d5d13d49",
"aligned": true,
"alignmentTool": "STAR",
"insertSize": -1,
"libraryStrategy": "RNA-Seq",
"pairedEnd": false,
"referenceGenome": "GRCh37",
"info": {}
}
}
{
"id":"FI206869",
"objectId":"5b91da7b-379d-5beb-9d2f-953795e9024c",
"access":"controlled",
"study":[
"PCAWG"
],
"dataCategorization":{
},
"dataBundle":{
"dataBundleId":"fe1bd829-6d0f-45c3-8594-85d4d5d13d49"
},
"fileCopies":[
{
"repoDataBundleId":"EGAZ00001312244",
"repoFileId":"EGAF00001719363",
"repoDataSetIds":[
"EGAD00001003547"
],
"repoCode":"ega",
"repoOrg":"EGA",
"repoName":"EGA - Hinxton",
"repoType":"EGA",
"repoCountry":"UK",
"repoBaseUrl":"http://ega.ebi.ac.uk/ega/",
"repoDataPath":"",
"repoMetadataPath":"/rest/download/v2/metadata/",
"indexFile":{
"id":"FI720618",
"objectId":"1249a65b-2d4b-5610-a981-e4a8604f862a",
"fileName":"fe1bd829-6d0f-45c3-8594-85d4d5d13d49/PCAWG.5f66b9a6-e1f3-11e4-8f30-aabb917fcde1.STAR.v1.bam.bai",
"fileFormat":"BAI",
"fileMd5sum":"e4028e4c7c17eae21d0892f49ddeaa95"
},
"fileName":"PCAWG.5f66b9a6-e1f3-11e4-8f30-aabb917fcde1.STAR.v1.bam",
"fileFormat":"BAM",
"fileMd5sum":"b43a6e975058f941251a7dad583e25cc",
"fileSize":5208225488,
"lastModified":1501787598
},
{
"repoDataBundleId":"fe1bd829-6d0f-45c3-8594-85d4d5d13d49",
"repoFileId":"5b91da7b-379d-5beb-9d2f-953795e9024c",
"repoDataSetIds":[
],
"repoCode":"aws-virginia",
"repoOrg":"AWS",
"repoName":"AWS - Virginia",
"repoType":"S3",
"repoCountry":"US",
"repoBaseUrl":"https://s3-external-1.amazonaws.com/",
"repoDataPath":"/oicr.icgc/data/5b91da7b-379d-5beb-9d2f-953795e9024c",
"repoMetadataPath":"/oicr.icgc.meta/metadata/dccb107d-7d98-5ba2-8ba1-0179a480eb03",
"indexFile":{
},
"fileName":"PCAWG.5f66b9a6-e1f3-11e4-8f30-aabb917fcde1.STAR.v1.bam",
"fileFormat":"BAM",
"fileMd5sum":"50c9d46fced666fc3b8b45c6a004bb96",
"fileSize":5208225488,
"lastModified":1473578198
},
{
"repoDataBundleId":"fe1bd829-6d0f-45c3-8594-85d4d5d13d49",
"repoFileId":"5b91da7b-379d-5beb-9d2f-953795e9024c",
"repoDataSetIds":[
],
"repoCode":"collaboratory",
"repoName":"Collaboratory - Toronto",
"repoType":"S3",
"repoCountry":"CA",
"repoBaseUrl":"https://www.cancercollaboratory.org:9080/",
"repoDataPath":"/oicr.icgc/data/5b91da7b-379d-5beb-9d2f-953795e9024c",
"repoMetadataPath":"/oicr.icgc.meta/metadata/dccb107d-7d98-5ba2-8ba1-0179a480eb03",
"indexFile":{
},
"fileName":"PCAWG.5f66b9a6-e1f3-11e4-8f30-aabb917fcde1.STAR.v1.bam",
"fileFormat":"BAM",
"fileMd5sum":"50c9d46fced666fc3b8b45c6a004bb96",
"fileSize":5208225488
}
],
"donors":[
{
"donorId":"DO45097",
"primarySite":"Liver",
"projectCode":"LIRI-JP",
"study":"PCAWG",
"sampleId":[
"SA560659"
],
"specimenId":[
"SP98913"
],
"specimenType":[
"Primary tumour - solid tissue"
],
"submittedDonorId":"RK007",
"submittedSampleId":[
"RK007_Cancer"
],
"submittedSpecimenId":[
"RK007_C01"
]
}
],
"referenceGenome":{
"genomeBuild":"GRCh37",
"referenceName":"hs37d5",
"downloadUrl":"ftp://ftp.sanger.ac.uk/pub/project/PanCancer/genome.fa.gz"
},
"analysisMethod":{
}
}
@rtisma any update on this?
It seems there are more such kind of files (was 1, now 83), see screenshot below:
@junjun-zhang please review the link below https://dcc.icgc.org/repositories?filters=%7B%22file%22:%7B%22study%22:%7B%22is%22:%5B%22PCAWG%22%5D%7D,%22software%22:%7B%22is%22:%5B%22_missing%22%5D%7D%7D%7D&files=%7B%22from%22:1,%22size%22:25%7D
there seems to be 3000+ files now using the same filter as in your description. I still see the data type and software missing for https://dcc.icgc.org/repositories/files/FI206869
so i will continue looking into that.
Discussion in planning: Looking at the portal, files with the correct data types are held in GNOS repositories. Metadata was being inherited from GNOS metadata - 2 GNOS repositories have been removed and the metadata associated with the removed repositories is not being carried into the portal now that they have been removed.
Solution: May need to add data. Kevin to determine!
Issue seems to be https://github.com/icgc-dcc/dcc-repository/blob/develop/dcc-repository-client/src/main/java/org/icgc/dcc/repository/client/core/RepositoryFileCombiner.java#L141
in the merge step of the dcc-repository-client, the order of the files dictates which value will be used. if 1 repo has a different value than the other repos for the same file, and if that repo is processed first, its value will be merged. This is a byproduct of no conflict resolution for dcc-repository.
As another example https://dcc.icgc.org/repositories/files/FI815862
i missing the software field. After lookiong at the associated analysis object from song:
curl -X GET --header 'Accept: application/json' 'https://song.cancercollaboratory.org/studies/RECA-EU/analysis/d6ea916a-609e-42fd-be75-571cdce4f592'
i see that software is also missing on song, and since song-aws and song-collab both dont have the field, its clear why the field is missing.
See the screenshot below. Not sure what made this file special, it's a PCAWG file, but it does not have
Data Type
orAnalysis Software
assigned for some reason.Also, if you follow the link (https://dcc.icgc.org/repositories/files/FI206869) to the File, it gives you a page with lots of missing fields.