knowledgesystems / pipelines-scrum

Repository for tracking uncategorizable issues related to backend pipelines work
0 stars 0 forks source link

Investigate difference between case lists and genomic profile sample counts #1274

Closed callachennault closed 2 months ago

callachennault commented 2 months ago

Done Condition (What do we need? Why do we need it? Keep this is small as possible!)

Technical Description (How are we going to achieve the above)

Ambar from AstraZeneca reached out about a discrepancy in the MSK-IMPACT data. There is a difference between the numbers in the Genomic Profile Sample Counts chart and the Case Lists chart for CNA data and SV data (examples below).

image.png image.png

Potential Issues

Dependencies

Technical Requirements

Outside People/Teams

Changes

callachennault commented 2 months ago

Case lists: Samples representing cases with actual events within the portal Genomic profile sample counts: Any sample that our system has designated as having a profile or analysis run on it. There is no guarantee that a sample that was profiled will have an event.

The discrepancy between CNA case lists (106,519) and CNA genomic profile sample count (106,529) is consistent with the logic above. There are more profiled samples than samples with events. The discrepancy between SV case lists (107,110) and SV genomic profile count (106,529) is slightly different. There is extra logic in the case lists script that uses the sequenced_samples heading in the MAF file as the case list for SV data. Meaning, this number represents all mutations within the portal - not just SV mutations. TODO: confirm this logic with curation.

When looking into the SV case lists discrepancy, we also noticed that there is a difference between the number of samples in the sequenced_samples header and the actual samples present in the MAF file + data gene matrix file. We diffed these files and requeued missing samples on 4/15

callachennault commented 2 months ago

https://mskconfluence.mskcc.org/display/CDSI/Genomic+Profile+Sample+Counts+vs+Case+Lists TODO cleanup