vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.1k stars 194 forks source link

Merging alignment graphs vs merging BAMs #4214

Open dbrami opened 8 months ago

dbrami commented 8 months ago

Hi, I'm not getting any response on the BioStars VG forum so figured i'd post here:

I'm aligning my reads to the reference graph using the same process as found [here](https://github.com/google/deepvariant/blob/r1.6/docs/deepvariant-vg-case-study.md):

    time ${DATA_DIR}/vg giraffe --progress \
    --read-group "ID:1 LB:lib1 SM:HG003 PL:illumina PU:unit1" \
    --sample "HG003" \
    -o BAM --ref-paths ${DATA_DIR}/GRCh38.path_list.txt \
    -P -L 3000 \
    -f ${DATA_DIR}/HG003.novaseq.pcr-free.35x.R1.fastq.gz \
    -f ${DATA_DIR}/HG003.novaseq.pcr-free.35x.R2.fastq.gz \
    -Z ${DATA_DIR}/hprc-v1.1-mc-grch38.gbz \
    --kff-name ${DATA_DIR}/HG003.fq.kff \
    --haplotype-name ${DATA_DIR}/hprc-v1.1-mc-grch38.hapl \
    -t $(nproc) > reads.unsorted.bam

Here are my questions:

  1. Is there any difference / benefit in merging multiple graphs from multiple read pairs vs doing a "samtools merge bam" to multiple result BAMs (after ordering alignments)? If so, what is command for merging graphs?
  2. Can you clarify whether we should be removing duplicates? At which point should this occur? Which tool is recommended - Picard MarkDuplicates?

Thanks you.

jeizenga commented 8 months ago
  1. As far as I can tell, there is only one graph involved here. What do you mean by "merging graphs"?
  2. If you have a PCR step in your library prep, any standard duplicate removal should work for the BAM output. MarkDuplicates is fine.
dbrami commented 8 months ago

Thanks for response.

  1. If I have technical replicates and wish to add more depth before variant calling, I would want to combine the aligned BAM files. I’m asking if there are any accuracy gains by merging graphs from technical replicates versus merging BAMs after conversion of graph to BAM.
  2. Yes - thanks for confirming.
jeizenga commented 8 months ago

It sounds like you are actually describing merging the reads before mapping. Is that correct? There is no second graph to be merged here. If I'm correct that you're referring to reads, then it shouldn't matter much. There might be a very small improvement in accuracy if you map the replicates separately and then combine the BAMs, because that will allow each replicate to estimate its own fragment length distribution during read mapping. I would expect the difference to be tiny though.