knowledgesystems / pipelines-scrum

Repository for tracking uncategorizable issues related to backend pipelines work
0 stars 0 forks source link

Duplicate variants in MSK_Sophia_LungPts_cBio_mutations_extended - 7-27-23.txt #1214

Closed n1zea144 closed 5 months ago

n1zea144 commented 7 months ago

Done Condition (What do we need? Why do we need it? Keep this is small as possible!)

Understand why duplicates got into MAF. Update pipeline to remove duplicates from delivered MAF.

Technical Description (How are we going to achieve the above)

Sophia claims there are 2556 duplicate variants in mutations extended. These 2213 are identified by unix sort/uniq

sort -k 1,1 -k 5,5 -k 6,6 -k 7,7 < ~/tmp/sophia-lung-cohort-maf.txt > ~/tmp/sophia-lung-cohort-maf-sorted.txt uniq -d ~/tmp/sophia-lung-cohort-maf-sorted.txt > ~/tmp/sophia-lung-cohort-maf-duplicates.txt

(sophia-lung-cohort-maf.txt is renamed/copy of MSK_Sophia_LungPts_cBio_mutations_extended - 7-27-23.txt)

sophia-lung-cohort-maf-duplicates.txt

Potential Issues

Dependencies

Technical Requirements

Outside People/Teams

Changes

callachennault commented 6 months ago

https://github.com/knowledgesystems/cmo-pipelines/pull/1099