giraffe alignment - Githubissues

userzxyz commented 3 years ago

Hello,

I have used giraffe for alignment using:

vg autoindex --workflow giraffe -g chr1.gfa -p pangenome
vg giraffe -H pangenome.giraffe.gbwt -g pangenome.gg -mpangenome.min -d pangenome.dist -f sample_R1_paired.fastq.gz -f sample_R2_paired.fastq.gz >mapped.gam
vg stats -a mapped.gam

With the stats, I got:

Total alignments: 33850102
Total primary: 33850102
Total secondary: 0
Total aligned: 275374
Total perfect: 165786
Total gapless (softclips allowed): 261643
Total paired: 33850102
Total properly paired: 18
Insertions: 18382 bp in 8489 read events
Deletions: 25545 bp in 9845 read events
Substitutions: 492971 bp in 492971 read events
Softclips: 2098177 bp in 52728 read events

My concern is the Total properly paired: 18. Can it be that low because I used gfa for only one chromosome?

Additionally, I tried finding how to use this alignment file for downstream analysis specifically variant calling. For vg map, I used:

vg surject -x chr1.xg -b chr1.gam > chr1.bam
vg pack -x chr1.xg -g chr1.gam -Q 5 -o aln.pack
vg call chr1.xg -k aln.pack > chr1_calls.vcf

I could not find the documentation for something similar with giraffe output. Thank you!

jeizenga commented 3 years ago

If you used a graph for one chromosome with raw shotgun sequencing data, you can get a lot of weird behavior because the majority of reads will have nowhere good to map. My first guess would be that Giraffe is not learning a good fragment length distribution because of the interchromosomal mappings, which is affecting its ability to identify properly paired reads.

For variant calling, you can use a GAM from Giraffe the same way as a GAM from VG map.

userzxyz commented 3 years ago

Thank you! I used:

vg convert -b pangenome.giraffe.gbwt -x pangenome.gg > pangenome.xg
vg surject -x pangenome.xg -b mapped.gam > output.bam

I got this error: error:[vg::get_sequence_dictionary] No non-alt-allele paths available in the graph!

jeizenga commented 3 years ago

Most of our graph formats label the path of the reference sequence, which is required for vg surject so that it can identify the mapped location on the reference. It seems that the gbwt/gg pipeline doesn't preserve that information (GBWT path storage is a bit different since it's actually storing a large collection of paths).

One option would be to make the XG directly from the GFA file using vg convert. @jltsiren are there any pitfalls to avoid with making the XG and GBWTGraph separately? It looked like we might have run into some problems in https://github.com/vgteam/vg/issues/3315.

userzxyz commented 3 years ago

@jeizenga Thank you! I tried:

vg convert -g chr1.gfa -x pangenome.gg > pangenome_1.xg
vg surject -x pangenome_1.xg -b mapped.gam > mapped.bam
vg pack -x pangenome_1.xg -g mapped.gam -Q 5 -o aln.pack
vg call pangenome_1.xg -k aln.pack > mapped.vcf

However, I got this warning for surject command line: warning[vg::Surjector]: Refusing to perform very large alignment against 160412 bp strand split subgraph for read sample.169050; suppressing further warnings.

jltsiren commented 3 years ago

I'm not sure what vg autoindex actually does if you give it GFA input and want Giraffe indexes. I believe GBWTGraph construction and XG construction will interpret GFA segments in the same way if no segment is longer than 1024 bp and all segments have either numerical or non-numerical names. If there are long segments or both numerical and non-numerical names, there will be many subtle edge cases. In general, I would recommend parsing the GFA file only once and using the results as the ground truth.

The ideal situation is that you have a GFA file with reference paths as P-lines and haplotype paths as W-lines. Then you can import the GFA with vg gbwt and build an XG with reference paths with vg convert:

vg gbwt -o graph.gbwt -g graph.gg -G graph.gfa
vg convert -x -b graph.gbwt graph.gg > graph.xg
# And then find snarls and build distance index from the XG

If you have both reference paths and haplotypes as P-lines, you must first ensure that all paths use the same naming scheme. For example, sample_contig_haplotype, where sample and contig are strings and haplotype is an integer. Reference paths should use ref as the sample name and have distinct contig names.

vg gbwt -o graph.gbwt -g graph.gg --path-regex "(.*)_(.*)_(.*)" --path-fields _SCH -G graph.gfa
vg convert -x -b graph.gbwt --ref-sample ref graph.gg > graph.xg

userzxyz commented 3 years ago

Thank you, @jltsiren! I tried using:

vg gbwt -o pangenome.gbwt -g pangenome.gg -G chr1.gfa
vg convert -x -b pangenome.gbwt pangenome.gg > graph.xg
vg snarls graph.xg > graph.snarls
vg index graph.xg -s graph.snarls -j graph.dist

However, the index job got killed without any error.

jltsiren commented 3 years ago

You should use option -T to include trivial snarls in the vg snarls command. I'm not sure if that will fix the particular crash you had, though.

userzxyz commented 3 years ago

Thank you! Using -T solved the index problem. Then I used:

vg mod -X 1000 smooth.gfa > graph.fixed.gfa 
vg gbwt -o sample.gbwt -g sample.gg -G graph.fixed.gfa 
vg convert -x -b sample.gbwt sample.gg > sample.xg 
vg snarls -T sample.xg > sample_1.snarls
vg index sample.xg -s sample_1.snarls -j sample_1.dist
vg giraffe -f sample_R1_paired.fastq.gz -f sample_R2_paired.fastq.gz -x sample.xg -H sample.gbwt > reads.gam

I got Segmentation fault. I found another issue with same problem and an updated version solved it. I am using vg/1.32.0 and I will try vg/1.33.0 and see how it goes.

userzxyz commented 3 years ago

By using vg/1.33.0:

vg mod -X 1000 smooth.gfa > graph.fixed.gfa 
vg gbwt -G graph.fixed.gfa -p -o sample.gbwt
vg convert -x -g graph.fixed.gfa > sample.xg 
vg snarls -T sample.xg > sample_1.snarls
vg index sample.xg -s sample_1.snarls -j sample_1.dist
vg giraffe -f sample_R1_paired.fastq.gz -f sample_R2_paired.fastq.gz -x sample.xg -H sample.gbwt > reads.gam

I still get Segmentation fault message. Can you please help with this!

jltsiren commented 3 years ago

This could be usability features leading to undesired behavior.

Giraffe uses four indexes for mapping: GBWT (-H graph.gbwt), GBWTGraph (-g graph.gg), distance index (-d graph.dist), and minimizer index (-m graph.min). If you do not provide all of them with command line arguments, Giraffe guesses the base name from the name of the graph (-x graph.xg or -g graph.gg) and assumes that files with the same base name and the correct extension contain the missing indexes. If that fails, Giraffe tries to build the missing indexes from the ones it can find.

In your case, any of these could be the problem:

There is an earlier minimizer index sample.min corresponding to a different graph. Delete it.
There is an earlier distance index sample.dist corresponding to a different graph. Delete it, and because you built a distance index with another base name, tell Giraffe about it with -d sample_1.dist.
There is an earlier GBWTGraph sample.gg corresponding to a different graph. Delete it, as it overrides the provided sample.xg.

After all this, please rerun Giraffe with option -p. If it still crashes, please post the progress information it prints.

userzxyz commented 3 years ago

Thank you @jltsiren! I deleted all those files with same base names and ran the giraffe alignment.

vg mod -X 1000 smooth.gfa > graph.fixed.gfa 
vg gbwt -G graph.fixed.gfa -p -o sample.gbwt
vg convert -x -g graph.fixed.gfa > sample.xg 
vg snarls -T sample.xg > sample.snarls
vg index sample.xg -s sample_1.snarls -j sample.dist
vg giraffe -f sample_R1_paired.fastq.gz -f sample_R2_paired.fastq.gz -x sample.xg -H sample.gbwt -d sample.dist > reads.gam

I got this:

Checking for availability of gg index: Needs to be built. Checking dependencies...
Checking for availability of vg index: Loadable from sample.xg
Checking for availability of gbwt index: Loadable from sample.gbwt
Checking for availability of gbwt index: Loadable from sample.gbwt
Checking for availability of min index: Needs to be built. Checking dependencies...
Checking for availability of gg index: Needs to be built. Checking dependencies...
Checking for availability of vg index: Loadable from sample.xg
Checking for availability of gbwt index: Loadable from sample.gbwt
Checking for availability of dist index: Loadable from sample.dist
Building min
Building gg
Loading vg from sample.xg
Loading gbwt from sample.gbwt
Loading dist from sample.dist
Initializing MinimizerMapper
Loading and initialization: 5.4119 seconds
Mapping reads to "-" (GAM)
--hit-cap 10
--hard-hit-cap 500
--score-fraction 0.9
--max-extensions 800
--max-alignments 8
--cluster-score 50
--pad-cluster-score 0
--cluster-coverage 0.3
--extension-score 1
--extension-set 20
--max-multimaps 1
--distance-limit 200
--paired-distance-limit 2
--rescue-subgraph-size 4
--rescue-attempts 15
--rescue-algorithm dozeu
warning[vg::giraffe]: Encountered 100000 ambiguously-paired reads before finding enough
                      unambiguously-paired reads to learn fragment length distribution. Are you sure
                      your reads are paired and your graph is not a hairball?
warning[vg::giraffe]: Finalizing fragment length distribution before reaching maximum sample size
                      mapped 11 reads single ended with 100000 pairs of reads left unmapped
                      mean: 0, stdev: 1
warning[vg::giraffe]: Cannot cluster reads with a fragment distance smaller than read distance
                      Fragment length distribution: mean=0, stdev=1
                      Fragment distance limit: 2, read distance limit: 200
warning[vg::giraffe]: Falling back on single-end mapping
Mapped 33850102 reads across 32 threads in 92.0515 seconds with 6.31461 additional single-threaded seconds.
Mapping speed: 11467 reads per second per thread
Used 1923.1 CPU-seconds (including output).
Achieved 17601.8 reads per CPU-second (including output)
Used 8419047710766 CPU instructions (not including output).
Mapping slowness: 0.248716 M instructions per read at 4377.84 M mapping instructions per inclusive CPU-second
Memory footprint: 1.00966 GB

Are the warning okay to proceed with? I want to mention that the .gfa graph file I used is only for one chromosome. Do I need to merge all gfa files for different chromosomes to use for giraffe? Thank you!

jltsiren commented 3 years ago

The graph should be at least as "large" as the read set. If you have whole-genome reads, you should map them to a whole-genome graph. Giraffe has not been designed for mapping whole-genome reads to a single-chromosome graph, and nobody really understands what will happen if you try it.

userzxyz commented 3 years ago

Thank you, @jltsiren. To merge the chromosome wise graphs, should I merge the .gfa output files or the index files. Can you suggest what would be a good way to merge chromosome-wise graph outputs to use for giraffe.

jltsiren commented 3 years ago

You should merge the GFA files, because some indexes cannot be merged.

Index construction for large GFA files may be slow and/or require a lot of memory, as the construction process has not been optimized for such inputs yet.

userzxyz commented 3 years ago

Thank you @jltsiren! I tried finding but could not find any method. Can you please suggest how to merge GFA files. Thank you!

AndreaGuarracino commented 3 years ago

@userzxyz, an alternative is odgi squeeze. You need to convert the GFAs to ODGI format (with odgi build), squeeze them in a single ODGI file, and then reconvert the squeezed ODGI file in GFA (with odgi view).

glennhickey commented 3 years ago

You can also use vg combine 1.gfa 2.gfa 3.gfa ... > combined.gfa

vgteam / vg

giraffe alignment #3316