nf-core / gwas

UNDER CONSTRUCTION: A pipeline for Genome Wide Association Studies
https://nf-co.re/gwas
MIT License
22 stars 17 forks source link

Reach feature parity with QC workflow of h3agwas #29

Open abhi18av opened 2 years ago

abhi18av commented 2 years ago

The goal of initial workflow (re-)design in DSL2 is to reach feature parity with the QC workflow of h3agwas (published https://github.com/h3abionet/h3agwas/releases/tag/v3.2.0)

abhi18av commented 2 years ago

Hi @shaze and @jeantristanb

I have been working on a DAG for the QC workflow and I've used the mermaid diagram for this purpose. I have relied on the mermaid.live to edit it and have a live preview, in terms of customizability we can use the colors/styling later on.

I am still giving it the finishing touches, especially w.r.t the channel names between the processes however, we can already see that we can already split the QC_REPORT out as a separate sub-workflow.

In addition, many of the channels which needed to be duplicated (due to DSL1 limitations) can be removed.

The plan is that we embed the DSL2 modules (which @Mxrcon is working on) and use the DAG as a guide for the time being.

graph LR
    subgraph SAMPLESHEET_AND_MD5

    FIXME_1 -- sample_sheet_ch --> 2[2.samplesheet]

    3[3.noSamplesheet]

    FIXME_2 -- checked_input.inpmd5ch --> 1[1.inMD5]

    end

    subgraph QC_PROCESSES

    2 -. poor_gc10_ch .-> 23[23.removeQCindivs]
    3 -. poor_gc10_ch .-> 23[23.removeQCindivs]

    checkedInput((checkedInput)) -- bim_ch --> 4--> 4.[getDuplicateMarkers]
    4[4.getDuplicateMarkers] -- duplicates_ch --> 5[5.removeDuplicateSNPs]

    checkedInput((checkedInput)) -- raw_ch --> 5[5.removeDuplicateSNPs]
    5[5.removeDuplicateSNPs]-- qc1D_ch --> 6[6.getX] -- X_chr_ch --> 7[7.analyzeX]
    5[5.removeDuplicateSNPs]-- ind_miss_ch1 --> 10[10.generateIndivMissingnessPlot]
    5[5.removeDuplicateSNPs]-- snp_miss_ch --> 9[9.generateSnpMissingnessPlot]
    5[5.removeDuplicateSNPs]-- qc1B_ch --> 8[8.identifyIndivDiscSexinfo] -- hwe_stats_ch --> 13[13.showHWEStats]
    5[5.removeDuplicateSNPs]-- qc1C_ch --> 11[11.getInitMAF] -- init_freq_ch --> 12[12.showInitMAF]
    5[5.removeDuplicateSNPs]-- qc1_ch --> 14[14.removeQCPhase1]
    5[5.removeDuplicateSNPs]-- ind_miss_ch2 --> 19[19.findRelatedIndiv]

    7[7.analyzeX] -- x_analy_res_ch --> 33[33.batchProc]

    8[8.identifyIndivDiscSexInfo] -- failed_sex_ch1 --> 23[23.removeQCIndivs]
    8[8.identifyIndivDiscSexInfo] --> 33[33.batchProc]

    14[14.removeQCPhase1] -- qc2A_ch --> 15[15.compPCA]
    14[14.removeQCPhase1] -- qc2B_ch --> 17[17.pruneForIBDLD]
    14[14.removeQCPhase1] -- qc2B_ch --> 18[18.pruneForIBD]
    14[14.removeQCPhase1] -- qc2C_ch --> 20[20.calculateSampleHeterozygosity]
    14[14.removeQCPhase1] -- qc2D_ch --> 23[23.removeQCIndivs]

    15[15.compPCA]-- pcares --> 16[16.drawPCA]
    15[15.compPCA]--> 18[18.pruneForIBD] --> 19[19.findRelatedIndiv]
    15[15.compPCA]--> 33[33.batchProc]

    17 .-> 33[33.batchProc]
    18 .-> 33[33.batchProc]

    19[19.findRelatedIndiv] --> 23[23.removeQCIndivs]
    19[19.findRelatedIndiv] --> 33[33.batchProc]

    20[20.calculateSampleHeterozygosity] --> 21[21.generateMissHetPlot]
    20[20.calculateSampleHeterozygosity] --> 22[22.getBadIndivsMissingHet]

    22[22.getBadIndivsMissingHet] --> 23[23.removeQCIndivs]

    23[23.removeQCIndivs] --> 24[24.calculateSnpSkewStatus]
    23[23.removeQCIndivs] --> 27[27.removeSkewSnps]

    24[24.calculateSnpSkewStatus] --> 25[25.generateDifferentialMissingnessPlot]
    24[24.calculateSnpSkewStatus] --> 26[26.findSnpExtremeDifferentialMissingness]
    24[24.calculateSnpSkewStatus] --> 30[30.findHWEofSNPs] --> 31[31.generateHwePlot]

    26[26.findSnpExtremeDifferentialMissingness] --> 27[27.removeSkewSnps]

    27[27.removeSkewSnps] --> 28[28.calculateMaf]
    27[27.removeSkewSnps] --> 32[32.outMD5]

    28[28.calculateMaf] --> 29[29.generateMafPlot]

    end

    subgraph QC_REPORT
    1 -- report_inpmd5_ch --> 34[34.produceReports]
    5[5.removeDuplicateSNPs]--> 34[34.produceReports]
    8[8.identifyIndivDiscSexInfo] --> 34[34.produceReports]
    %% style 8 fill:#f9f,stroke:#333,stroke-width:4px

    9 --> 34[34.produceReports]
    10 --> 34[34.produceReports]
    12 --> 34[34.produceReports]
    13 --> 34[34.produceReports]
    14[14.removeQCPhase1]--> 34[34.produceReports]
    21[21.generateMissHetPlot] --> 34[34.produceReports]
    22[22.getBadIndivsMissingHet] --> 34[34.produceReports]
    25[25.generateDifferentialMissingnessPlot] --> 34[34.produceReports]
    26[26.findSnpExtremeDifferentialMissingness] --> 34[34.produceReports]
    29[29.generateMafPlot] --> 34[34.produceReports]
    31[31.generateHwePlot] --> 34[34.produceReports]
    32[32.outMD5] --> 34[34.produceReports]
    33[33.batchProc] --> 34[34.produceReports]

    end

The code for the diagram is mentioned below

graph LR
    subgraph SAMPLESHEET_AND_MD5

    FIXME_1 -- sample_sheet_ch --> 2[2.samplesheet]

    3[3.noSamplesheet]

    FIXME_2 -- checked_input.inpmd5ch --> 1[1.inMD5]

    end

    subgraph QC_PROCESSES

    2 -. poor_gc10_ch .-> 23[23.removeQCindivs]
    3 -. poor_gc10_ch .-> 23[23.removeQCindivs]

    checkedInput((checkedInput)) -- bim_ch --> 4--> 4.[getDuplicateMarkers]
    4[4.getDuplicateMarkers] -- duplicates_ch --> 5[5.removeDuplicateSNPs]

    checkedInput((checkedInput)) -- raw_ch --> 5[5.removeDuplicateSNPs]
    5[5.removeDuplicateSNPs]-- qc1D_ch --> 6[6.getX] -- X_chr_ch --> 7[7.analyzeX]
    5[5.removeDuplicateSNPs]-- ind_miss_ch1 --> 10[10.generateIndivMissingnessPlot]
    5[5.removeDuplicateSNPs]-- snp_miss_ch --> 9[9.generateSnpMissingnessPlot]
    5[5.removeDuplicateSNPs]-- qc1B_ch --> 8[8.identifyIndivDiscSexinfo] -- hwe_stats_ch --> 13[13.showHWEStats]
    5[5.removeDuplicateSNPs]-- qc1C_ch --> 11[11.getInitMAF] -- init_freq_ch --> 12[12.showInitMAF]
    5[5.removeDuplicateSNPs]-- qc1_ch --> 14[14.removeQCPhase1]
    5[5.removeDuplicateSNPs]-- ind_miss_ch2 --> 19[19.findRelatedIndiv]

    7[7.analyzeX] -- x_analy_res_ch --> 33[33.batchProc]

    8[8.identifyIndivDiscSexInfo] -- failed_sex_ch1 --> 23[23.removeQCIndivs]
    8[8.identifyIndivDiscSexInfo] --> 33[33.batchProc]

    14[14.removeQCPhase1] -- qc2A_ch --> 15[15.compPCA]
    14[14.removeQCPhase1] -- qc2B_ch --> 17[17.pruneForIBDLD]
    14[14.removeQCPhase1] -- qc2B_ch --> 18[18.pruneForIBD]
    14[14.removeQCPhase1] -- qc2C_ch --> 20[20.calculateSampleHeterozygosity]
    14[14.removeQCPhase1] -- qc2D_ch --> 23[23.removeQCIndivs]

    15[15.compPCA]-- pcares --> 16[16.drawPCA]
    15[15.compPCA]--> 18[18.pruneForIBD] --> 19[19.findRelatedIndiv]
    15[15.compPCA]--> 33[33.batchProc]

    17 .-> 33[33.batchProc]
    18 .-> 33[33.batchProc]

    19[19.findRelatedIndiv] --> 23[23.removeQCIndivs]
    19[19.findRelatedIndiv] --> 33[33.batchProc]

    20[20.calculateSampleHeterozygosity] --> 21[21.generateMissHetPlot]
    20[20.calculateSampleHeterozygosity] --> 22[22.getBadIndivsMissingHet]

    22[22.getBadIndivsMissingHet] --> 23[23.removeQCIndivs]

    23[23.removeQCIndivs] --> 24[24.calculateSnpSkewStatus]
    23[23.removeQCIndivs] --> 27[27.removeSkewSnps]

    24[24.calculateSnpSkewStatus] --> 25[25.generateDifferentialMissingnessPlot]
    24[24.calculateSnpSkewStatus] --> 26[26.findSnpExtremeDifferentialMissingness]
    24[24.calculateSnpSkewStatus] --> 30[30.findHWEofSNPs] --> 31[31.generateHwePlot]

    26[26.findSnpExtremeDifferentialMissingness] --> 27[27.removeSkewSnps]

    27[27.removeSkewSnps] --> 28[28.calculateMaf]
    27[27.removeSkewSnps] --> 32[32.outMD5]

    28[28.calculateMaf] --> 29[29.generateMafPlot]

    end

    subgraph QC_REPORT
    1 -- report_inpmd5_ch --> 34[34.produceReports]
    5[5.removeDuplicateSNPs]--> 34[34.produceReports]
    8[8.identifyIndivDiscSexInfo] --> 34[34.produceReports]
    %% style 8 fill:#f9f,stroke:#333,stroke-width:4px

    9 --> 34[34.produceReports]
    10 --> 34[34.produceReports]
    12 --> 34[34.produceReports]
    13 --> 34[34.produceReports]
    14[14.removeQCPhase1]--> 34[34.produceReports]
    21[21.generateMissHetPlot] --> 34[34.produceReports]
    22[22.getBadIndivsMissingHet] --> 34[34.produceReports]
    25[25.generateDifferentialMissingnessPlot] --> 34[34.produceReports]
    26[26.findSnpExtremeDifferentialMissingness] --> 34[34.produceReports]
    29[29.generateMafPlot] --> 34[34.produceReports]
    31[31.generateHwePlot] --> 34[34.produceReports]
    32[32.outMD5] --> 34[34.produceReports]
    33[33.batchProc] --> 34[34.produceReports]

    end