Question: Best strategy starting from allelic chromosomes instead of reference plus VCF

alfonsosaera commented 4 years ago

Hi all,

I am trying to do genotyping (DNAseq samples) and allele specific expression (RNAseq samples) of a recently assembled plant (alfalfa (Medicago sativa L.)), nature paper here , this is a tetraploid species so everything gets a little more complicated.

Instead of having a reference genome and a VCF, the authors generated an allele-aware chromosome-level genome assembly for the cultivated alfalfa consisting of 32 allelic chromosomes (8 chr x 4 copies), chrom lengths:

>chr1.1
82459472
>chr1.2
86910131
>chr1.3
79881340
>chr1.4
88815615
>chr2.1
76462061
>chr2.2
74215936
>chr2.3
76375162
>chr2.4
76750018
>chr3.1
93149498
>chr3.2
93375939
>chr3.3
96157931
>chr3.4
100414524
>chr4.1
90245664
>chr4.2
93947428
>chr4.3
90228617
>chr4.4
90896203
>chr5.1
81211777
>chr5.2
84165483
>chr5.3
80712490
>chr5.4
78626892
>chr6.1
80303593
>chr6.2
89579199
>chr6.3
84649260
>chr6.4
64534737
>chr7.1
88407277
>chr7.2
93528358
>chr7.3
91580142
>chr7.4
94657719
>chr8.1
87242343
>chr8.2
84274390
>chr8.3
82440740
>chr8.4
81801543

I recently found vg tools and I thing it solves most of the problems I have but I am not sure how to make the analysis. Here is what I have now:

1.- Generate a graph for each chromosome set and then combine them.

Based on the github page I though about something like:

./vg msga -f chr2.fasta -t 50 -k 16 -D -A | ./vg mod -U 10 - -t 50 | ./vg mod -c - -t 50 > chr2.vg

But reading the wiki, a modification of this seems better:

vg msga -f MHC.fa \ # use MHC.fa as our input
    -b ref \        # the >ref sequence is our basis
    -B 128 \        # align 128-mer bands of the sequences
    -K 11 \         # use 16-mers as the basis for our indexing
    -X 2 \          # double twice in GCSA2, generating a 44-mer deBruijn graph
    -E 3 \          # remove edges from the graph that would allow us to cross 3 bifurcations in 11bp
    -G -S 0.95 \    # greedily accept alignments when they match at 95% of our maximum alignment score
    -H 5 \          # ignore MEMs which have more than 5 hits
    -D \            # basic debugging output, describing progression through the process
    >MHC.vg         # write the output to MHC.vg

vg mod -n MHC.vg | vg mod -X 32 - | vg ids -c - >MHC.norm.vg

How do I set parameters like -B, -K, -E ...?

Now I would have to fuse/concatenate the 8 _chromosomegraph.vg to get a _genomegraph.vg using vg concat

2.- Augment the graph with info from DNA-seq samples

However reading the github page I am not sure how to do it.

3.- Map RNA-seq samples to augmented graph

I am thinking in a modification of this:

# map reads against the graph to get a GAM
vg map -T x.sim.txt -x x.xg -g x.gcsa > aln.gam

# surject the alignments back into the reference space of sequence "x", yielding a BAM file
vg surject -x x.xg -b aln.gam > aln.bam

# or alternatively, surject them to BAM in the call to map
vg map -T x.sim.txt -x x.xg -g x.gcsa --surject-to bam > aln.bam

4.- Perform allele specific expression

My idea was to get a VCF from the augmented graph (aug.graph.vcf), but reading the Calling variants using read support section I don't really know how to do it. Next I would extract a linear reference (aug.graph.linear.fasta) from the graph to use GATK ASEReadCounter using the bam from step 3 (aln.bam) as below:

gatk ASEReadCounter -R aug.graph.linear.fasta -I aln.bam -V aug.graph.vcf -O output.table

Thanks for getting to here, I have 3 general questions:

1.- Do you find this approach suitable? would it be better to focus only in coding regions?

2.- I have a modest server, not a cluster or supercomputer, with ~ 200GB RAM and ~40 processors, will I be able to run vg tools for my samples in this machine?

3.- I found toil-vg which seems to make life easier reducing command calls and parallelize vg tools. Is this something I could use in my machine?

4.- At the end of the analysis I need to check gene expression so I would need to add gene annotations to the genome graph from a GFF of the assembled genome, is there a command for this? alternatively, is there a way to do something like a liftover between the published genome and the resulting graph?

Thank you very much!!!!

Alfonso

ekg commented 4 years ago

Great questions!

As part of the human pangenome project, many of us have been working on pipelines for building pangenomes.

I have developed one which I will post in the coming days.

It uses three components.

First, an mash-map and edit-distance based mapper is used to scaffold the pangenome. (edyeet)

The pangenome graph is induced from the alignments. (seqwish)

The graph is then sorted with a form of multi-dimensional scaling in 1D, groomed, and topologically ordered locally. The 1D order is then broken into fragments which are run through an MSA to normalize their mutual alignment and remove artifacts of the edit-distance based alignment. (smoothxg)

The resulting graph can then be manipulated with odgi for visualization and interrogation.

It can also be loaded into any of the GFA-based mapping tools, including vg map, mpmap, giraffe, and GraphAligner.

Alignments to the graph can be used to make variant calls (vg call) and coverage vectors over the pangenome, which can be useful for phylogeny and association analyses.

ekg commented 4 years ago

The process is designed to be completable on a server like the one you have.

Augmenting the graph is presumably possible. You may need to break it into chromosomes to be sure the augmentation tools in vg work (in terms of memory requirements). Breaking it into chromosomes is certainly possible.

I think with mpmap you should be able to do the RNA seq analysis you're describing. There is vg rna, which is actively being developed now.

alfonsosaera commented 4 years ago

Thanks for your answer @ekg , I wait your pipeline with enthusiasm

jonassibbesen commented 4 years ago

Hi Alfonso,

Yes as Erik mentions, you can use vg rna to add transcript annotations to your graph. You can also use this tool if you are interested in projecting the transcripts on to other haplotypes either embedded as paths in the graph or as gbwt.

I would also recommend using mpmap for mapping RNA-seq reads. You can find more information here https://github.com/vgteam/vg/wiki/Multipath-alignments-and-vg-mpmap

Re allele-specific expression. I am actually working on a tool for estimating haplotype-specific expression that you might find interesting. It is called rpvg and you can find it here: https://github.com/jonassibbesen/rpvg. It is basically a tool for estimating the most probably set of paths in a graph and optionally their associated abundances. It should work on polyploid species, however I have only tested it on diploid so I would be really interested in hearing about your experience if you decide to use it. In order to run it you need read alignments, a graph and a gbwt with haplotype-specific transcript paths. The latter you can get from vg rna. This approach will of course not work with variants predicted using vg call since you will not have any haplotypes covering those. However, it should work for estimating haplotype-specific expression across your 4 reference haplotypes if that is what you are interested in.

Best,

Jonas

ekg commented 4 years ago

@alfonsosaera I've posted the construction process that I was describing: https://github.com/pangenome/pggb

Please let me know if you have any questions by posting an issue there.

The goal is to make graphs that are usable by vg as well as other tools. Hopefully this will get you started.

I would suggest testing with a single chromosome or small region and then working your way up. The specific parameters defining the initial alignment can be difficult to choose, and must be done in a species and sequence-sensitive way. You'll need to iterate a few times to see what works.

alfonsosaera commented 4 years ago

Hi @ekg and @jonassibbesen

Thanks for your useful answers, I will review everything and come back if I have further doubts.

vgteam / vg

Question: Best strategy starting from allelic chromosomes instead of reference plus VCF #2973