Done Condition (What do we need? Why do we need it? Keep this is small as possible!)
DDP file missing patients (~10000) - no vital status or demographics. To the extent possible, ensure data in DDP folder is complete for patients delivered to GENIE. New upload deadline Aug 29.
Overall issue: Missing DDP data to use to generate GENIE
GENIE is a subset of msk_solid_heme but currently uses mskimpact/ddp files to generate the genie patient list, so by nature it's missing DDP data for the other 3 studies that are a part of msk_solid_heme (HEMEPACT, ARCHER, and ACCESS).
The subset-impact-data.sh file doesn't write out patients that aren't in the DDP files (specifically naaccr file) - and since they're using a ddp file that is "missing" a lot of patients - curators had to change the subset-impact-data.sh script functionality and have been generating genie data using an unmerged bug fix branch (https://github.com/knowledgesystems/cmo-pipelines/pull/819 - point 3 is the relevant point)
mskimpact is missing ~400 patients in ddp_naaccr and ddp_vital_status files - is this a reasonable amount for "VIP" patients?
Also a note: mskimpact/ddp/ddp_age.txt is not being updated anymore. This was removed in this PR: https://github.com/knowledgesystems/cmo-pipelines/pull/1020 when we moved age of sequencing to DDP. We should delete this file if it's not being updated or used
Steps
[x] Modify DDP pipeline code to see if we can write out ddp data for the studies HEMEPACT, ARCHER, and ACCESS
[x] If above is possible, may need code to merge the ddp folders when creating msk_solid_heme
[x] Remove ddp_age file from dmp repo (confirm it's not being used)
[x] Confirm with curators exactly how genie data is being generated (on what server, what code branch, etc) - is the genie data being generated with the age at sequencing fix if the unmerged branch is being used to generate data?
Technical Description (How are we going to achieve the above)
Done Condition (What do we need? Why do we need it? Keep this is small as possible!)
DDP file missing patients (~10000) - no vital status or demographics. To the extent possible, ensure data in DDP folder is complete for patients delivered to GENIE. New upload deadline Aug 29.
Overall issue: Missing DDP data to use to generate GENIE
mskimpact/ddp
files to generate the genie patient list, so by nature it's missing DDP data for the other 3 studies that are a part of msk_solid_heme (HEMEPACT, ARCHER, and ACCESS).ddp/
folder - why is this? Does this data not exist for the other studies or do we just not store it? Ex: https://github.com/knowledgesystems/cmo-pipelines/blob/5f2bf4b837f601b3ca43137e188f70f3d5d429ce/ddp/ddp_pipeline/src/main/java/org/mskcc/cmo/ks/ddp/pipeline/SuppNaaccrMappingsWriter.java#L71, https://github.com/knowledgesystems/cmo-pipelines/blob/5f2bf4b837f601b3ca43137e188f70f3d5d429ce/ddp/ddp_pipeline/src/main/java/org/mskcc/cmo/ks/ddp/pipeline/SuppVitalStatusWriter.java#L71C22-L71C39subset-impact-data.sh
file doesn't write out patients that aren't in the DDP files (specifically naaccr file) - and since they're using a ddp file that is "missing" a lot of patients - curators had to change thesubset-impact-data.sh
script functionality and have been generating genie data using an unmerged bug fix branch (https://github.com/knowledgesystems/cmo-pipelines/pull/819 - point 3 is the relevant point)mskimpact/ddp/ddp_age.txt
is not being updated anymore. This was removed in this PR: https://github.com/knowledgesystems/cmo-pipelines/pull/1020 when we moved age of sequencing to DDP. We should delete this file if it's not being updated or usedSteps
Technical Description (How are we going to achieve the above)
Potential Issues
Dependencies
Technical Requirements
Outside People/Teams
Changes