Trace DDP Provided Data for GENIE Patients

averyniceday commented 1 year ago

Done Condition (What do we need? Why do we need it? Keep this is small as possible!)

DDP file missing patients (~10000) - no vital status or demographics. To the extent possible, ensure data in DDP folder is complete for patients delivered to GENIE. New upload deadline Aug 29.

Overall issue: Missing DDP data to use to generate GENIE

GENIE is a subset of msk_solid_heme but currently uses mskimpact/ddp files to generate the genie patient list, so by nature it's missing DDP data for the other 3 studies that are a part of msk_solid_heme (HEMEPACT, ARCHER, and ACCESS).
All of these studies do DDP fetch for the same attributes but only mskimpact writes out these files to a ddp/ folder - why is this? Does this data not exist for the other studies or do we just not store it? Ex: https://github.com/knowledgesystems/cmo-pipelines/blob/5f2bf4b837f601b3ca43137e188f70f3d5d429ce/ddp/ddp_pipeline/src/main/java/org/mskcc/cmo/ks/ddp/pipeline/SuppNaaccrMappingsWriter.java#L71, https://github.com/knowledgesystems/cmo-pipelines/blob/5f2bf4b837f601b3ca43137e188f70f3d5d429ce/ddp/ddp_pipeline/src/main/java/org/mskcc/cmo/ks/ddp/pipeline/SuppVitalStatusWriter.java#L71C22-L71C39
The subset-impact-data.sh file doesn't write out patients that aren't in the DDP files (specifically naaccr file) - and since they're using a ddp file that is "missing" a lot of patients - curators had to change the subset-impact-data.sh script functionality and have been generating genie data using an unmerged bug fix branch (https://github.com/knowledgesystems/cmo-pipelines/pull/819 - point 3 is the relevant point)
mskimpact is missing ~400 patients in ddp_naaccr and ddp_vital_status files - is this a reasonable amount for "VIP" patients?
Also a note: mskimpact/ddp/ddp_age.txt is not being updated anymore. This was removed in this PR: https://github.com/knowledgesystems/cmo-pipelines/pull/1020 when we moved age of sequencing to DDP. We should delete this file if it's not being updated or used

Steps

[x] Modify DDP pipeline code to see if we can write out ddp data for the studies HEMEPACT, ARCHER, and ACCESS
[x] If above is possible, may need code to merge the ddp folders when creating msk_solid_heme
[x] Remove ddp_age file from dmp repo (confirm it's not being used)
[x] Confirm with curators exactly how genie data is being generated (on what server, what code branch, etc) - is the genie data being generated with the age at sequencing fix if the unmerged branch is being used to generate data?

Technical Description (How are we going to achieve the above)

Potential Issues

Dependencies

Technical Requirements

Outside People/Teams

Changes

callachennault commented 1 year ago

https://github.com/knowledgesystems/cmo-pipelines/pull/1065 https://github.mskcc.org/knowledgesystems/dmp-2023/pull/15