uab-cgds-worthey / quac

🦆 Quality Control of WGS and exome samples 🦆
https://quac.readthedocs.io
GNU General Public License v3.0
5 stars 1 forks source link

Setup system/smoke testing that tests if workflow runs successfully end-to-end. #4

Closed ManavalanG closed 1 year ago

ManavalanG commented 3 years ago

In GitLab by @ManavalanG on Mar 16, 2021, 23:06

ManavalanG commented 3 years ago

Attempt 1. Using outbam bams/vcfs from small var valler pipeline test datasets:

verifybamid exits with error as it couldn't find any markers in that datsset.

Attempt 2. Using test bam file from verifybamid repo:

Our ref genome uses chr20 but bam file has 20 contig naming. Also bam is aligned b37. verifybamid results in No reads found in any of the regions, exit! error.

Attempt 3. Use a small region of NA12878 bam

Cmd used: samtools view -b -h /data/project/worthey_lab/projects/experimental_pipelines/mana/test_tools/wgs/small_var_caller/data/NA12878/bam/NA12878.bam "chr20:59993-3653078" > extracted.bam

verifybamid exits with error Insufficient Available markers. Note that verifybamid testing uses just the chr20 as ref genome to get around this issue.

Attempt 4. Use a sub-sample of 40x NA12878 bam

Sub-sample 40x NA12878 bam to small-ish fraction.

Failed. Tested verifybamid with bams subsampled at various levels (0.01%, 0.1%, 0.5%). It ran successfully with 0.5% bam, but failed with others due to Insufficient Available markers. 0.5% bam works but it is of size 660Mb!!!!! PS - Full grch38 reference genome was used as reference here.

ManavalanG commented 3 years ago

Solution adopted:

Note that above two datasets are not connected (ie. vcf not derived from the bam file used) but this is acceptable for current testing purposes.

Above are codified into script - https://gitlab.rc.uab.edu/center-for-computational-genomics-and-data-science/sciops/pipelines/quac/-/blob/qc_under_one_umbrella/.test/setup_test_datasets.sh

ManavalanG commented 3 years ago

Somalier vcf doesn't have sample column and this leads to error when using with bcftools stats. So switched to a test dataset provided by bcftools instead.

ManavalanG commented 3 years ago

QuaC also needs test QC outputs for fastq (and sample rename config), which get created by small var caller pipeline. This was achieved by running the small variant caller pipeline using its test datasets with some modifications. Steps are described in https://gitlab.rc.uab.edu/center-for-computational-genomics-and-data-science/sciops/pipelines/quac/-/blob/qc_under_one_umbrella/.test/README.md

ManavalanG commented 3 years ago

Capture region bed files are needed to support exome samples. Having unrelated bam and vcf files with each having different genomic regions could spell trouble. So I switched to new test bams and vcfs, which were derived from NA12878 by subsampling in the same genomic region. Test capture-regions bed file was also created.