vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.08k stars 191 forks source link

Mapping very short reads with vg giraffe #3998

Open gamerino opened 1 year ago

gamerino commented 1 year ago

Hi vg team!

I'm trying to use vg giraffe for mapping very short paired-end reads (36bp) against a pangenome.

I tried running this command line

> vg giraffe -M 10 -x ${panIdxPath}/${panPre}.xg -g ${panIdxPath}/${panPre}.gg \
-H ${panIdxPath}/${panPre}.gbwt -L 2000 -m ${panIdxPath}/${panPre}.min  \
-d ${panIdxPath}/${panPre}.dist -p -f ${fastq1} -f ${fastq2} -t 30 1> ${outPre}.gam \
--report-name ${log} 2> ${log}_run.log

The program didn't produce an error but it returned single-end alignments since it failed to pair reads.

warning[vg::giraffe]: Encountered 100000 ambiguously-paired reads before finding enough
                      unambiguously-paired reads to learn fragment length distribution. Are you sure
                      your reads are paired and your graph is not a hairball?
warning[vg::giraffe]: Finalizing fragment length distribution before reaching maximum sample size
                      mapped 0 reads single ended with 100000 pairs of reads left unmapped
                      mean: 0, stdev: 1
warning[vg::giraffe]: Cannot cluster reads with a fragment distance smaller than read distance
                      Fragment length distribution: mean=0, stdev=1
                      Fragment distance limit: 2, read distance limit: 200
warning[vg::giraffe]: Falling back on single-end mapping
Using fragment length estimate: 0 +/- 1

I tried re-running vg giraffe setting different values for some parameters I understood are related to read length:

but in all cases, I obtained the same results.

I also tried to manually set --fragment-mean 86 and --fragment-stdev 80. In this case, warning messages didn't appear in the log file, instead it reports the defined values were used.

Using fragment length estimate: 86 +/- 80

However, when I checked the stats (samtools flagstats) for the bam file obtained with vg surject, the percentage of properly paired reads is 0%

> vg surject -b -i -t 30 -x ${panIdxPath}/${panPre}.xg ${prefix}.gam 1> ${prefix}.bam
> samtools flagstat -@30 ${prefix}.bam

326331274 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 mapped (0.00% : N/A)
326331274 + 0 paired in sequencing
163165637 + 0 read1
163165637 + 0 read2
0 + 0 properly paired (0.00% : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

I read the documentation and wiki page but I didn't find a way to fix it. I also tried to look for a parameter for reducing the seed value (default 29) but I didn't find it.

Is there any limitation in the vg giraffe implementation for the minimum read length for paired-end experiments? I ran it using 76PE and 101PE and it worked fine. If it is not, can you please recommend me alternatives for being able to get alignments for properly paired reads for this experiment?

Thank you for your help!

Gabriela

jltsiren commented 1 year ago

By default, Giraffe uses k = 29, w = 11 minimizers. If the reads are shorter than the minimizer window (k+w-1 bp), there won't be any seeds. You can adjust the minimizer parameters by building a new minimizer index. For example:

vg minimizer -p -t 16 -k 21 -w 11 -d graph.dist -g graph.gbwt -o graph.min graph.gg
gamerino commented 1 year ago

Hi @jltsiren! Thank you!

I was able to build this new minimizer index. Now, for running vg giraffe for my PE 36bp reads do you recommend using default parameter values or considering some of the alternatives I was trying before like:

Thanks in advance!

jltsiren commented 1 year ago

I would try the default parameters first. Nobody has much experience with using Giraffe with very short reads, but I'd expect that if any issues arise, they will be about not having enough seeds.

You may have to try different minimizer parameters if the fraction of mapped reads is lower than expected.

gamerino commented 11 months ago

Hi @jltsiren!

Thanks for your suggestions. I created a new minimizers index with -k 15 -w 5 and then used it to map my PE 36bp reads running this command:

vg giraffe -M 10 -x ${panIdxPath}/${panPre}.xg -g ${panIdxPath}/${panPre}.gg -H ${panIdxPath}/${panPre}.gbwt \
-L 2000 -m ${panIdxPath}/${panPre}_k15w5.min -d ${panIdxPath}/${panPre}.dist -p -f ${fastq1} -f ${fastq2} -t 40 \
-D 86 1> ${outPre}.gam --report-name ${log} 2> ${log}_run.log

It worked for most replicates but for one of them, I'm having an error in the *run.log report.

Guessing that hprc-v1.0-mc-chm13-minaf.0.1.giraffe.gbz is Giraffe GBZ
Preparing Indexes
Loading Minimizer Index
Loading GBZ
Loading Distance Index v2
Paging in Distance Index v2
Initializing MinimizerMapper
Loading and initialization: 195.263 seconds
Of which Distance Index v2 paging: 6.36955 seconds
Mapping reads to "-" (GAM)
--watchdog-timeout 10
--max-multimaps 10
--hit-cap 10
--hard-hit-cap 500
--score-fraction 0.9
--max-min 500
--num-bp-per-min 1000
--distance-limit 86
--max-extensions 800
--max-alignments 8
--cluster-score 50
--pad-cluster-score 20
--cluster-coverage 0.3
--extension-score 1
--extension-set 20
--rescue-attempts 15
--max-fragment-length 2000
--paired-distance-limit 2
--rescue-subgraph-size 4
--rescue-seed-limit 100
--chaining-cluster-distance 100
--precluster-connection-coverage-threshold 0.3
--min-precluster-connections 10
--max-precluster-connections 50
--max-lookback-bases 100
--min-lookback-items 1
--lookback-item-hard-cap 15
--chain-score-threshold 100
--min-chains 1
--chain-min-score 100
--max-chain-connection 100
--max-tail-length 100
--max-dp-cells 16777216
--rescue-algorithm dozeu
Using fragment length estimate: 75.6621 +/- 40.152
vg: src/dozeu_interface.cpp:126: vg::DozeuInterface::graph_pos_s vg::DozeuInterface::calculate_max_position(const vg::DozeuInterface::OrderedGraph&, const vg::DozeuInterface::graph_pos_s&, size_t, bool, const std::vector<const dz_forefront_s*>&): Assertion `forefronts.at(max_node_index)->mcap != nullptr' failed.
━━━━━━━━━━━━━━━━━━━━
Crash report for vg v1.48.0 "Gallipoli"
Stack trace (most recent call last) in thread 181304:
#20   Object "", at 0xffffffffffffffff, in 
#19   Object "/homes/gmerino/anaconda3/envs/vg/bin/vg", at 0x1f8f7f2, in __clone
#18   Object "/homes/gmerino/anaconda3/envs/vg/bin/vg", at 0x1492898, in start_thread
#17   Object "/homes/gmerino/anaconda3/envs/vg/bin/vg", at 0x1ec3ffd, in gomp_thread_start
#16   Object "/homes/gmerino/anaconda3/envs/vg/bin/vg", at 0x118d73f, in unsigned long vg::io::paired_for_each_parallel_after_wait<vg::Alignment>(std::function<bool (vg::Alignment&, vg::Alignment&)>, std::function<void (vg::Alignment&, vg::Alignment&)>, std::function<bool ()>, unsigned long) [clone ._omp_fn.0]
#15   Object "/homes/gmerino/anaconda3/envs/vg/bin/vg", at 0x1ec68d7, in gomp_team_barrier_wait_end
#14   Object "/homes/gmerino/anaconda3/envs/vg/bin/vg", at 0x1ebdf9b, in gomp_barrier_handle_tasks
#13   Object "/homes/gmerino/anaconda3/envs/vg/bin/vg", at 0x118d8ea, in unsigned long vg::io::paired_for_each_parallel_after_wait<vg::Alignment>(std::function<bool (vg::Alignment&, vg::Alignment&)>, std::function<void (vg::Alignment&, vg::Alignment&)>, std::function<bool ()>, unsigned long) [clone ._omp_fn.1]
#12   Object "/homes/gmerino/anaconda3/envs/vg/bin/vg", at 0xd83600, in std::_Function_handler<void (vg::Alignment&, vg::Alignment&), main_giraffe(int, char**)::{lambda()#1}::operator()() const::{lambda(vg::Alignment&, vg::Alignment&)#6}>::_M_invoke(std::_Any_data const&, vg::Alignment&, vg::Alignment&)
#11   Object "/homes/gmerino/anaconda3/envs/vg/bin/vg", at 0xeeeb94, in vg::MinimizerMapper::map_paired(vg::Alignment&, vg::Alignment&, std::vector<std::pair<vg::Alignment, vg::Alignment>, std::allocator<std::pair<vg::Alignment, vg::Alignment> > >&)
#10   Object "/homes/gmerino/anaconda3/envs/vg/bin/vg", at 0xee958b, in vg::MinimizerMapper::map_paired(vg::Alignment&, vg::Alignment&)
#9    Object "/homes/gmerino/anaconda3/envs/vg/bin/vg", at 0xedd4ba, in void vg::MinimizerMapper::process_until_threshold_c<double>(unsigned long, std::function<double (unsigned long)> const&, std::function<bool (unsigned long, unsigned long)> const&, double, unsigned long, unsigned long, vg::LazyRNG&, std::function<bool (unsigned long)> const&, std::function<void (unsigned long)> const&, std::function<void (unsigned long)> const&) const [clone .constprop.0]
#8    Object "/homes/gmerino/anaconda3/envs/vg/bin/vg", at 0xef0ae8, in vg::MinimizerMapper::map_paired(vg::Alignment&, vg::Alignment&)::{lambda(unsigned long)#16}::operator()(unsigned long) const
#7    Object "/homes/gmerino/anaconda3/envs/vg/bin/vg", at 0xeefacc, in vg::MinimizerMapper::attempt_rescue(vg::Alignment const&, vg::Alignment&, vg::VectorView<vg::MinimizerMapper::Minimizer> const&, bool)
#6    Object "/homes/gmerino/anaconda3/envs/vg/bin/vg", at 0xfb63d2, in vg::Aligner::align_xdrop(vg::Alignment&, handlegraph::HandleGraph const&, std::vector<handlegraph::handle_t, std::allocator<handlegraph::handle_t> > const&, std::vector<vg::MaximalExactMatch, std::allocator<vg::MaximalExactMatch> > const&, bool, unsigned short) const
#5    Object "/homes/gmerino/anaconda3/envs/vg/bin/vg", at 0x121d9ab, in vg::DozeuInterface::align(vg::Alignment&, handlegraph::HandleGraph const&, std::vector<handlegraph::handle_t, std::allocator<handlegraph::handle_t> > const&, std::vector<vg::MaximalExactMatch, std::allocator<vg::MaximalExactMatch> > const&, bool, signed char, unsigned short)
#4    Object "/homes/gmerino/anaconda3/envs/vg/bin/vg", at 0x121b9d1, in vg::DozeuInterface::calculate_max_position(vg::DozeuInterface::OrderedGraph const&, vg::DozeuInterface::graph_pos_s const&, unsigned long, bool, std::vector<dz_forefront_s const*, std::allocator<dz_forefront_s const*> > const&)
#3    Object "/homes/gmerino/anaconda3/envs/vg/bin/vg", at 0x1ee7555, in __assert_fail
#2    Object "/homes/gmerino/anaconda3/envs/vg/bin/vg", at 0x5bfed7, in __assert_fail_base.cold
#1    Object "/homes/gmerino/anaconda3/envs/vg/bin/vg", at 0x5c0007, in abort
#0    Object "/homes/gmerino/anaconda3/envs/vg/bin/vg", at 0x149611b, in raise
ERROR: Signal 6 occurred. VG has crashed. Visit https://github.com/vgteam/vg/issues/new/choose to report a bug.
Please include this entire error log in your bug report!
━━━━━━━━━━━━━━━━━━━━
error[MinimizerMapper::faster_cap]: Minimizers seem impossible to disrupt in region 16 17 0 6
━━━━━━━━━━━━━━━━━━━━
Crash report for vg v1.48.0 "Gallipoli"
Stack trace (most recent call last) in thread 181310:
#16   Object "", at 0xffffffffffffffff, in 

Can you please indicate what is the source of this error?

Thanks for your help!

IsaacDiaz026 commented 10 months ago

I am running into this same issue, have you been able to figure it out?