Open abhi18av opened 2 years ago
Hi @shaze and @jeantristanb
I have been working on a DAG for the QC workflow and I've used the mermaid diagram for this purpose. I have relied on the mermaid.live to edit it and have a live preview, in terms of customizability we can use the colors/styling later on.
I am still giving it the finishing touches, especially w.r.t the channel names between the processes however, we can already see that we can already split the QC_REPORT
out as a separate sub-workflow.
In addition, many of the channels which needed to be duplicated (due to DSL1 limitations) can be removed.
The plan is that we embed the DSL2 modules (which @Mxrcon is working on) and use the DAG as a guide for the time being.
graph LR
subgraph SAMPLESHEET_AND_MD5
FIXME_1 -- sample_sheet_ch --> 2[2.samplesheet]
3[3.noSamplesheet]
FIXME_2 -- checked_input.inpmd5ch --> 1[1.inMD5]
end
subgraph QC_PROCESSES
2 -. poor_gc10_ch .-> 23[23.removeQCindivs]
3 -. poor_gc10_ch .-> 23[23.removeQCindivs]
checkedInput((checkedInput)) -- bim_ch --> 4--> 4.[getDuplicateMarkers]
4[4.getDuplicateMarkers] -- duplicates_ch --> 5[5.removeDuplicateSNPs]
checkedInput((checkedInput)) -- raw_ch --> 5[5.removeDuplicateSNPs]
5[5.removeDuplicateSNPs]-- qc1D_ch --> 6[6.getX] -- X_chr_ch --> 7[7.analyzeX]
5[5.removeDuplicateSNPs]-- ind_miss_ch1 --> 10[10.generateIndivMissingnessPlot]
5[5.removeDuplicateSNPs]-- snp_miss_ch --> 9[9.generateSnpMissingnessPlot]
5[5.removeDuplicateSNPs]-- qc1B_ch --> 8[8.identifyIndivDiscSexinfo] -- hwe_stats_ch --> 13[13.showHWEStats]
5[5.removeDuplicateSNPs]-- qc1C_ch --> 11[11.getInitMAF] -- init_freq_ch --> 12[12.showInitMAF]
5[5.removeDuplicateSNPs]-- qc1_ch --> 14[14.removeQCPhase1]
5[5.removeDuplicateSNPs]-- ind_miss_ch2 --> 19[19.findRelatedIndiv]
7[7.analyzeX] -- x_analy_res_ch --> 33[33.batchProc]
8[8.identifyIndivDiscSexInfo] -- failed_sex_ch1 --> 23[23.removeQCIndivs]
8[8.identifyIndivDiscSexInfo] --> 33[33.batchProc]
14[14.removeQCPhase1] -- qc2A_ch --> 15[15.compPCA]
14[14.removeQCPhase1] -- qc2B_ch --> 17[17.pruneForIBDLD]
14[14.removeQCPhase1] -- qc2B_ch --> 18[18.pruneForIBD]
14[14.removeQCPhase1] -- qc2C_ch --> 20[20.calculateSampleHeterozygosity]
14[14.removeQCPhase1] -- qc2D_ch --> 23[23.removeQCIndivs]
15[15.compPCA]-- pcares --> 16[16.drawPCA]
15[15.compPCA]--> 18[18.pruneForIBD] --> 19[19.findRelatedIndiv]
15[15.compPCA]--> 33[33.batchProc]
17 .-> 33[33.batchProc]
18 .-> 33[33.batchProc]
19[19.findRelatedIndiv] --> 23[23.removeQCIndivs]
19[19.findRelatedIndiv] --> 33[33.batchProc]
20[20.calculateSampleHeterozygosity] --> 21[21.generateMissHetPlot]
20[20.calculateSampleHeterozygosity] --> 22[22.getBadIndivsMissingHet]
22[22.getBadIndivsMissingHet] --> 23[23.removeQCIndivs]
23[23.removeQCIndivs] --> 24[24.calculateSnpSkewStatus]
23[23.removeQCIndivs] --> 27[27.removeSkewSnps]
24[24.calculateSnpSkewStatus] --> 25[25.generateDifferentialMissingnessPlot]
24[24.calculateSnpSkewStatus] --> 26[26.findSnpExtremeDifferentialMissingness]
24[24.calculateSnpSkewStatus] --> 30[30.findHWEofSNPs] --> 31[31.generateHwePlot]
26[26.findSnpExtremeDifferentialMissingness] --> 27[27.removeSkewSnps]
27[27.removeSkewSnps] --> 28[28.calculateMaf]
27[27.removeSkewSnps] --> 32[32.outMD5]
28[28.calculateMaf] --> 29[29.generateMafPlot]
end
subgraph QC_REPORT
1 -- report_inpmd5_ch --> 34[34.produceReports]
5[5.removeDuplicateSNPs]--> 34[34.produceReports]
8[8.identifyIndivDiscSexInfo] --> 34[34.produceReports]
%% style 8 fill:#f9f,stroke:#333,stroke-width:4px
9 --> 34[34.produceReports]
10 --> 34[34.produceReports]
12 --> 34[34.produceReports]
13 --> 34[34.produceReports]
14[14.removeQCPhase1]--> 34[34.produceReports]
21[21.generateMissHetPlot] --> 34[34.produceReports]
22[22.getBadIndivsMissingHet] --> 34[34.produceReports]
25[25.generateDifferentialMissingnessPlot] --> 34[34.produceReports]
26[26.findSnpExtremeDifferentialMissingness] --> 34[34.produceReports]
29[29.generateMafPlot] --> 34[34.produceReports]
31[31.generateHwePlot] --> 34[34.produceReports]
32[32.outMD5] --> 34[34.produceReports]
33[33.batchProc] --> 34[34.produceReports]
end
The code for the diagram is mentioned below
graph LR
subgraph SAMPLESHEET_AND_MD5
FIXME_1 -- sample_sheet_ch --> 2[2.samplesheet]
3[3.noSamplesheet]
FIXME_2 -- checked_input.inpmd5ch --> 1[1.inMD5]
end
subgraph QC_PROCESSES
2 -. poor_gc10_ch .-> 23[23.removeQCindivs]
3 -. poor_gc10_ch .-> 23[23.removeQCindivs]
checkedInput((checkedInput)) -- bim_ch --> 4--> 4.[getDuplicateMarkers]
4[4.getDuplicateMarkers] -- duplicates_ch --> 5[5.removeDuplicateSNPs]
checkedInput((checkedInput)) -- raw_ch --> 5[5.removeDuplicateSNPs]
5[5.removeDuplicateSNPs]-- qc1D_ch --> 6[6.getX] -- X_chr_ch --> 7[7.analyzeX]
5[5.removeDuplicateSNPs]-- ind_miss_ch1 --> 10[10.generateIndivMissingnessPlot]
5[5.removeDuplicateSNPs]-- snp_miss_ch --> 9[9.generateSnpMissingnessPlot]
5[5.removeDuplicateSNPs]-- qc1B_ch --> 8[8.identifyIndivDiscSexinfo] -- hwe_stats_ch --> 13[13.showHWEStats]
5[5.removeDuplicateSNPs]-- qc1C_ch --> 11[11.getInitMAF] -- init_freq_ch --> 12[12.showInitMAF]
5[5.removeDuplicateSNPs]-- qc1_ch --> 14[14.removeQCPhase1]
5[5.removeDuplicateSNPs]-- ind_miss_ch2 --> 19[19.findRelatedIndiv]
7[7.analyzeX] -- x_analy_res_ch --> 33[33.batchProc]
8[8.identifyIndivDiscSexInfo] -- failed_sex_ch1 --> 23[23.removeQCIndivs]
8[8.identifyIndivDiscSexInfo] --> 33[33.batchProc]
14[14.removeQCPhase1] -- qc2A_ch --> 15[15.compPCA]
14[14.removeQCPhase1] -- qc2B_ch --> 17[17.pruneForIBDLD]
14[14.removeQCPhase1] -- qc2B_ch --> 18[18.pruneForIBD]
14[14.removeQCPhase1] -- qc2C_ch --> 20[20.calculateSampleHeterozygosity]
14[14.removeQCPhase1] -- qc2D_ch --> 23[23.removeQCIndivs]
15[15.compPCA]-- pcares --> 16[16.drawPCA]
15[15.compPCA]--> 18[18.pruneForIBD] --> 19[19.findRelatedIndiv]
15[15.compPCA]--> 33[33.batchProc]
17 .-> 33[33.batchProc]
18 .-> 33[33.batchProc]
19[19.findRelatedIndiv] --> 23[23.removeQCIndivs]
19[19.findRelatedIndiv] --> 33[33.batchProc]
20[20.calculateSampleHeterozygosity] --> 21[21.generateMissHetPlot]
20[20.calculateSampleHeterozygosity] --> 22[22.getBadIndivsMissingHet]
22[22.getBadIndivsMissingHet] --> 23[23.removeQCIndivs]
23[23.removeQCIndivs] --> 24[24.calculateSnpSkewStatus]
23[23.removeQCIndivs] --> 27[27.removeSkewSnps]
24[24.calculateSnpSkewStatus] --> 25[25.generateDifferentialMissingnessPlot]
24[24.calculateSnpSkewStatus] --> 26[26.findSnpExtremeDifferentialMissingness]
24[24.calculateSnpSkewStatus] --> 30[30.findHWEofSNPs] --> 31[31.generateHwePlot]
26[26.findSnpExtremeDifferentialMissingness] --> 27[27.removeSkewSnps]
27[27.removeSkewSnps] --> 28[28.calculateMaf]
27[27.removeSkewSnps] --> 32[32.outMD5]
28[28.calculateMaf] --> 29[29.generateMafPlot]
end
subgraph QC_REPORT
1 -- report_inpmd5_ch --> 34[34.produceReports]
5[5.removeDuplicateSNPs]--> 34[34.produceReports]
8[8.identifyIndivDiscSexInfo] --> 34[34.produceReports]
%% style 8 fill:#f9f,stroke:#333,stroke-width:4px
9 --> 34[34.produceReports]
10 --> 34[34.produceReports]
12 --> 34[34.produceReports]
13 --> 34[34.produceReports]
14[14.removeQCPhase1]--> 34[34.produceReports]
21[21.generateMissHetPlot] --> 34[34.produceReports]
22[22.getBadIndivsMissingHet] --> 34[34.produceReports]
25[25.generateDifferentialMissingnessPlot] --> 34[34.produceReports]
26[26.findSnpExtremeDifferentialMissingness] --> 34[34.produceReports]
29[29.generateMafPlot] --> 34[34.produceReports]
31[31.generateHwePlot] --> 34[34.produceReports]
32[32.outMD5] --> 34[34.produceReports]
33[33.batchProc] --> 34[34.produceReports]
end
The goal of initial workflow (re-)design in DSL2 is to reach feature parity with the QC workflow of h3agwas (published https://github.com/h3abionet/h3agwas/releases/tag/v3.2.0)