Closed ManavalanG closed 1 year ago
Attempt 1. Using outbam bams/vcfs from small var valler pipeline test datasets:
verifybamid exits with error as it couldn't find any markers in that datsset.
Attempt 2. Using test bam file from verifybamid repo:
Our ref genome uses chr20
but bam file has 20
contig naming. Also bam is aligned b37. verifybamid results in No reads found in any of the regions, exit!
error.
Attempt 3. Use a small region of NA12878 bam
Cmd used:
samtools view -b -h /data/project/worthey_lab/projects/experimental_pipelines/mana/test_tools/wgs/small_var_caller/data/NA12878/bam/NA12878.bam "chr20:59993-3653078" > extracted.bam
verifybamid exits with error Insufficient Available markers
. Note that verifybamid testing uses just the chr20
as ref genome to get around this issue.
Attempt 4. Use a sub-sample of 40x NA12878 bam
Sub-sample 40x NA12878 bam to small-ish fraction.
Failed. Tested verifybamid with bams subsampled at various levels (0.01%, 0.1%, 0.5%). It ran successfully with 0.5% bam, but failed with others due to Insufficient Available markers
. 0.5% bam works but it is of size 660Mb!!!!! PS - Full grch38 reference genome was used as reference here.
Solution adopted:
Bam: Use the test bam file provided by verifybamid. Prep it to use chr
prefix for contigs and then change sample name in its header as desired.
VCF: Use the test vcf file provided by somalier. Prep it to use chr
prefix for contigs.
Note that above two datasets are not connected (ie. vcf not derived from the bam file used) but this is acceptable for current testing purposes.
Above are codified into script - https://gitlab.rc.uab.edu/center-for-computational-genomics-and-data-science/sciops/pipelines/quac/-/blob/qc_under_one_umbrella/.test/setup_test_datasets.sh
Somalier vcf doesn't have sample column and this leads to error when using with bcftools stats
. So switched to a test dataset provided by bcftools instead.
QuaC also needs test QC outputs for fastq (and sample rename config), which get created by small var caller pipeline. This was achieved by running the small variant caller pipeline using its test datasets with some modifications. Steps are described in https://gitlab.rc.uab.edu/center-for-computational-genomics-and-data-science/sciops/pipelines/quac/-/blob/qc_under_one_umbrella/.test/README.md
Capture region bed files are needed to support exome samples. Having unrelated bam and vcf files with each having different genomic regions could spell trouble. So I switched to new test bams and vcfs, which were derived from NA12878 by subsampling in the same genomic region. Test capture-regions bed file was also created.
In GitLab by @ManavalanG on Mar 16, 2021, 23:06