vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.07k stars 191 forks source link

Mapping paired end reads with vg giraffe: "Falling back on single-end mapping" #4276

Open evcurran opened 2 months ago

evcurran commented 2 months ago

Hello,

I've been mapping paired end reads to a graph built with minigraph-cactus

vg giraffe --progress --fragment-mean $frag_mean --fragment-stdev $frag_stdev -Z ${inDir}/arenosa_pg.d2.gbz -m ${inDir}/arenosa_pg.d2.min -d ${inDir}/arenosa_pg.d2.dist -f $fq1 -f $fq2 -t 16 > ${outdir}/${line}.gam

The job finishes with the following, including the warning that single-end mapping was used:

Using fragment length estimate: 149.056 +/- 10.3994
warning[vg::giraffe]: Cannot cluster reads with a fragment distance smaller than read distance
                      Fragment length distribution: mean=149.056, stdev=10.3994
                      Fragment distance limit: 169, read distance limit: 201
warning[vg::giraffe]: Falling back on single-end mapping
Mapped 37134034 reads across 16 threads in 1538.31 seconds with 0.0203757 additional single-threaded seconds.
Mapping speed: 1508.72 reads per second per thread
Used 24396.3 CPU-seconds (including output).
Achieved 1522.12 reads per CPU-second (including output)
Used 201059975612237 CPU instructions (not including output).
Mapping slowness: 5.41444 M instructions per read at 8241.4 M mapping instructions per inclusive CPU-second
Memory footprint: 23.9501 GB

Here are the stats of the resulting gam file:

Total alignments: 31443472
Total primary: 31443472
Total secondary: 0
Total aligned: 28039881
Total perfect: 8795394
Total gapless (softclips allowed): 23925064
Total paired: 31443472
Total properly paired: 0
Alignment score: mean 138.204, median 151, stdev 30.6938, max 161 (7133288 reads)
Mapping quality: mean 39.8665, median 60, stdev 25.1088, max 60 (15262285 reads)
Insertions: 5320436 bp in 2103474 read events
Deletions: 9863023 bp in 5339307 read events
Substitutions: 44040586 bp in 44040586 read events
Softclips: 210513256 bp in 4883684 read events
Total time: 18285.9 seconds
Speed: 1719.54 reads/second

If I continue processing this data with vg pack and vg call (I plan on looking at SVs in a VCF) should I be worried about accuracy considering there are no properly paired reads? Are there any additional steps I should be including?

Thanks for any help with this, Emma

jeizenga commented 2 months ago

In general, it's better to let vg giraffe estimate the fragment length distribution on its own rather than specifying it at the command line. I strongly suspect the parameters you provided don't match the true fragment length distribution. In my experience, the observed standard deviations are usually at least 50 bp. Also, unless you have highly fragmented data (e.g. ancient DNA), then your fragment lengths are probably longer than your read length (which appears to be 151). How did you arrive at the values you gave it?

evcurran commented 2 months ago

Hi, Thanks for your quick response!

I see, I've confused fragment and read length here. I was letting giraffe estimate fragment length distribution, but I have been having problems with giraffe jobs running slowly, and eventually hanging with no progression. Some of my samples completed alignment, but I'm struggling with the rest. I've been trying to find a solution to this, and saw a suggestion to specify fragment length distribution.

jeizenga commented 2 months ago

To help you with that, I'd have to know more about the graph construction and mapping pipeline you're running. Could you copy the commands?

evcurran commented 2 months ago

I constructed my graph using minigraph-cactus from one chromosome-level assembly and 34 unscaffolded assemblies (a primary and an alt assembly from 17 samples) using the following:

cactus-pangenome ./jobstore arenosa_seqfile_update.txt --outDir ${pgDir} --outName arenosa_pg_allIndexes --reference Aare --filter 2 --haplo --giraffe clip filter --viz --odgi --chrom-vg clip filter --chrom-og --permissiveContigFilter --gbz clip filter full --gfa clip full --vcf --logFile ./arenosa_pg_allIndexes.log --binariesMode singularity --refContigs scaffold_1 scaffold_2 scaffold_3 scaffold_4 scaffold_5 scaffold_6 scaffold_7 scaffold_8 

I then mapped paired-end reads to the 'd2' filtered graph using the following:

$vg giraffe -Z ${inDir}/arenosa_pg.d2.gbz -m ${inDir}/arenosa_pg.d2.min -d ${inDir}/arenosa_pg.d2.dist -f $fq1 -f $fq2 -t 8 > ${outdir}/${line}.gam

Some jobs completed, others stalled. Here's an example of the gam stats from a sample that successfully ran:

Total alignments: 27654834
Total primary: 27654834
Total secondary: 0
Total aligned: 27008484
Total perfect: 6726705
Total gapless (softclips allowed): 24682699
Total paired: 27654834
Total properly paired: 19942108
Alignment score: mean 89.5906, median 121, stdev 67.7648, max 160 (5669101 reads)
Mapping quality: mean 20.6108, median 2, stdev 25.2075, max 60 (6516165 reads)
Insertions: 2583430 bp in 1127200 read events
Deletions: 4989208 bp in 2820754 read events
Substitutions: 25572674 bp in 25572674 read events
Softclips: 1398492172 bp in 21046554 read events
Total time: 103069 seconds
Speed: 268.314 reads/second

Thanks!

jeizenga commented 2 months ago

I'm not sure if this is the cause of the slowdown, but those statistics suggest a wildly repetitive pangenome graph. More than half of your reads have MAPQ ≤ 2, and the primary alignment scores seem low as well. Is that expected with the species that you're studying?

evcurran commented 2 months ago

Perhaps something has gone during graph construction as it shouldn't be extremely repetitive. What is it in the stats that suggests a repetitive graph, is it the low mapping quality? How do you interpret primary alignment scores? Is this the 'Alignment score' distribution, and is it based on read length? Is there a description somewhere of what the difference is between the different stats, e.g. 'Total alignments' and 'Total aligned', and 'Total paired' and 'total properly paired'? Sorry for all my questions!

Here are some stats from another gam file, from a different sample aligned to the same graph, which appears to have much higher mapping quality. Is this closer to what I should be aiming for?

Total alignments: 83186564
Total primary: 83186564
Total secondary: 0
Total aligned: 70184776
Total perfect: 24706816
Total gapless (softclips allowed): 61168167
Total paired: 83186564
Total properly paired: 68453422
Alignment score: mean 132.785, median 150, stdev 37.9442, max 161 (16490059 reads)
Mapping quality: mean 39.5819, median 60, stdev 26.463, max 60 (41425218 reads)
Insertions: 12301934 bp in 4598768 read events
Deletions: 22375927 bp in 12315834 read events
Substitutions: 99440665 bp in 99440665 read events
Softclips: 579269976 bp in 11161033 read events
Total time: 175220 seconds
Speed: 474.754 reads/second

Many thanks!

jeizenga commented 1 month ago

It's primarily the MAPQs that suggest repetition. The primary reason for low MAPQ is that there is more than one equally good mapping location. That happens to some extent in every genome, but having a median of 2 is unusual. That means we have less than 50% confidence in more than 50% of the reads.

Scores are a little harder to interpret, but vg's default match bonus is 1 per base, and the mismatch penalty is 4 per base. Because short read sequencing tends to be pretty accurate, most reads get a score that's similar to the read length. It looks like you have 150 bp reads. Your median read has score ~120, which would imply ~6 mismatches. That comes to a 4% combined error+polymorphism rate, which is high.

For the terminology:

The stats from your other GAM seem much closer to typical expectations, which suggests that the problem probably isn't the graph. Maybe it would be worth doing some sequencing QC?

evcurran commented 1 month ago

OK, that's really helpful information, thank you!

Is it possible that it's the quality of the short reads that I'm trying to align that was making the jobs so slow/stop running, or could that be a separate issue?

Thanks!

jeizenga commented 1 month ago

Could be possible. I generally wouldn't expect the slow-down to be dramatic. However, I did notice that the higher-error reads you showed before had about half the speed as the lower-error reads. Another thing to check is whether the file size of the output is growing over time. If not, that would suggest that it's hanging, or perhaps completely bogged down with a small number of reads. If it is growing, that would look more like an overall speed issue.