vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.11k stars 194 forks source link

RNA-seq mapping with genome assembly and vcf files #3653

Closed lsoldini closed 2 years ago

lsoldini commented 2 years ago

Hello,

What I have done

1) Used ppgb to build a graph genome with the -v parameter to directly use vg deconstruct in the pipeline, to generate a .vcf file

2) Used vcfbub to avoid overlapping sites and unnecessary nested variantes (e.g., large inversions I have etc.).

What I am not sure how to do

While reading throug vg giraffe tutorial, I saw that the index was commonly done chromosome by chromosome. Since the full genome is quite small in my case, I did everything at once through pggb and vg deconstruct.

Do I need to split my vcf in files corresponding to each chromosome ? Or is there a way to avoid this ? If I have to do so, how would I deal with remaining contigs ?

Thank you !

Best, Luca

jltsiren commented 2 years ago

Building indexes by chromosome is a time and/or space optimization. The construction requires less memory if you can break it into independent subtasks, and you can also choose to run multiple subtasks in parallel. It's not necessary with smaller graphs (up to maybe a few hundred Mbp).

If you built the graph by aligning sequences using tools such as PGGB, you should use vg autoindex (see the wiki) for building the indexes from the graph in GFA format. Going through VCF is unnecessary in such cases, and I'm not even sure if anyone has ever tried doing that. It should be noted that PGGB sometimes creates highly collapsed graph regions that may make index construction for them very slow.

lsoldini commented 2 years ago

Thank you for replying.

I will give it a try without splitting by chromosomes/contigs then (I have a 200 Mbp genome).

To reply to the second part, I would have two types of concerns.

On the use of vg giraffe

It is interesting because your point is contrasting with other feedback I have received and information I have gathered. From my understanding, vg giraffe was not designed to take as input a .gfa as produced by pggb, or at least it has not been used that way and hence the results are quite unsure + more people are using the .fasta with .vcf way. On the other hand, with vg deconstruct, it is possible to create a .vcf from pggb output, and then removing nested variants representing large SV (I have Mbp inversions) with vcfbub -r 10000 -l 0 for instance (then we have non-overlapping bubbles, as required by vg giraffe). The .vcf can then be used with vg giraffe for reads mapping. Also, the vg autoindex on the wiki suggests using a .fasta and .vcf.

So, I am not really sure how I should proceed ? Overall, is it possible to use the .gfa produced by pggb directly for downstream analysis with vg giraffe ?

Similar question with vg mpmap

I am now trying to use vg mpmap since it is splice-aware and allows for haplotype-specific expression analysis. But the underlying problem would be the same: should I use a .vcf and .fasta to build the .pg graph or rather directly the .gfa produced by pggb ? When I tried a simple conversion form pggb .gfa to .pg it did not work (I mean, it worked, but the output was not recognized by downstream tools, maybe I should have done a post on that).

I was thinking about following the workflow of the scripts from the 2021 bioRxiv paper (Haplotype-aware pantranscriptome analysis using spliced pangenome graphs), which goes with (i) vg construct, (ii) vg convert, and (iii) vg rna to construct the the spliced pangenome graph.

Should I proceed differently since I have build the graph with pggb ? Similarly, would it be possible to make all steps at once with the scripts from the bioRxiv paper - i.e., was it done like that only for memory/time reasons, or are there differences in the way chromosomes and contigs should be handled ?

jltsiren commented 2 years ago

If you built the graph with PGGB and Giraffe does not work well when you build the indexes directly from the GFA file, you probably can't map short reads to that graph with any existing tool. All tools make their own assumptions what the graph should look like, and most basic tasks remain unsolved problems if the graph is too general.

If you give VG a FASTA reference and a VCF file, it will build a new graph using that information. If you got the VCF file from vg deconstruct, the new graph will probably be different from the original graph you deconstructed.

In addition to the graph, Giraffe needs a set of haplotypes in a GBWT index. If you build the GBWT from a phased VCF file, the haplotypes will be generated using a naive algorithm. The algorithm assumes that the variants in the VCF file are non-overlapping edits to the reference. Each alternate allele replaces a reference interval with another substring. If a part of that interval has already been replaced (if a haplotype has overlapping alternate alleles), the algorithm does something arbitrary that may or may not make sense. In particular, that rules out most nested variants.

Both vg map and vg mpmap use a GCSA index, which does not like complex graph regions. Such regions must be simplified first, which is a lossy process unless you have the original haplotypes available in a GBWT index. You can get the GBWT index directly from the GFA file, but I'm not sure if vg autoindex supports that workflow.

jeizenga commented 2 years ago

I'll just verify that we haven't built out an autoindex pipeline for making and indexing spliced pangenome graphs from GFAs. However, @jonassibbesen has successfully done it manually. Maybe the scripts for doing it are available somewhere?

lsoldini commented 2 years ago

Thank you both for your answers. I think I have mixed some aspects, and that there are some components I haven't said that may create some confusion. I will try to clarify this.

System

A diploid species in which there is a supergene existing in two well defined and distinct haplotypes, let say A and B. There are individuals AA, AB, and BB. There are individuals that are haploids, and those were used for sequencing and genome assembly. I have two assemblies (say A and B, corresponding to the supergene's genotype). resolved at the chromosome-level, haplotype-resolved, and each of 27 chromosomes + ~500 contigs. Each assembly has its own .gff3 as well as transcript annotation.

The supergene covers the majority of one chromosome, and it contains large inversions (each few Mbp), as well as thousands of SNPs, but also genes CNV, deletions, and insertions, as well as repetitive sequences. The rest of the genome is fairly similar between A and B assemblies, although there are SNPs as well as small inversions/translocations events (which have neither been confirmed nor investigated yet, so I cannot say whether those are due to bioinformatics / sequencing / true biology).

Scope

I want to do RNA-seq analysis of paired-end illumina reads (10-20 M / sample) to identify genes/alleles that are differentially expressed between tissues that are AA, AB, or BB.

Ideally, it would be nice to identify isoforms specific to one haplotype, as done in rpvg. This would be particularly interesting because some kind of genomic imprinting is expected to occur.

Workflow: done

Since the differences between the two assemblies are huge (i.e., specifically between the supergene), classical mapping on a linear genome seemed out of scope and I hence looked at pangenomics.

pggb seemed the best because of the large SVs, and it worked fine: I have a final .gfa as well as .og if necessary. Using vg deconstruct I also obtained a .vcf. I used vcfbub to remove nested/overlapping snarls (in particular for the large inversion). Also, since I have only two assemblies, it is not possible to have three different variants at one locus (I think).

Workflow: next

Ideally, it would be wonderful to be able to map my reads on a graph genome, but it does not have to be the one build with pggb. In particular, it is not important if it contains the large SVs etc.

I think trying to use vg giraffe for RNA-seq was not appropriate since it is to map DNA short reads (please correct me if I am wrong).

My conclusion is that vg mpmap would be the best. I've read through the bioRxiv paper and with the scripts it seemed do-able with the data I have (i.e., .vcf and .fasta, but if .gfa works, perfect).

Questions

Does this seems do-able with the vg toolkit ?

I will make a second comment with the things I have tried.

lsoldini commented 2 years ago

vg convert (kind of side note)

When I was trying with vg giraffe (which now seems wrong for RNA), I tried vg convert on my .gfa from pggb. But converting the .gfa seems actually wrong since it would contain many nested and overlapping snarls (still, the large inversions). In addition, this did not work and the .gfa converted was not recognized by vg giraffe as a .pg

vg construct/etc.

I am trying to do as in the bioRxiv paper, and I will try to edit this.

i first had problems with vg construct -t 5 -C -a -v variants.vcf.gz -r seq.fasta, I ended up with the following error:

vg: src/constructor.cpp:2230: void vg::Constructor::construct_graph(const std::vector<FastaReference*>&, const std::vector<vcflib::VariantCallFile*>&, const std::vector<FastaReference*>&, const std::function<void(vg::Graph&)>&): Assertion reference_for.count(fasta_contig)' failed. ERROR: Signal 6 occurred. VG has crashed.

The problem was coming from the fact that the .vcf as created by vg deconstruct had issues. For some unknown reasons, it contained variants from both assembly.

This was solved by re-doing the graph with pggb (in particular, changes in the headers).

vg convert Worked fine to convert .vg to .pg

vg rna Crashes with error signal 6 occurred (see bug report).

jonassibbesen commented 2 years ago

Hi, yes as far as I can see you should be able to perform the mentioned analyses using the mpmap-rpvg pipeline.

I just replied to your other issue (https://github.com/vgteam/vg/issues/3659) regarding the vg rna crash. To create a spliced pangenome graph and a pantranscriptome (set of haplotype-specific transcripts) you can use the following command:

vg rna -p -t <threads> -n annotation.[gtf|gff3] -r -u -b pantranscriptome.gbwt -i pantranscriptome.txt graph.pg > spliced_graph.pg

You can use rpvg to infer the expression of the haplotype-specific transcripts in the pantranscriptome. I can see in your other issue (https://github.com/vgteam/vg/issues/3659) that you used the -e option. This option is not needed in your specific case since you already have annotations for all assemblies. You would use -e if you wanted to project transcript annotations between haplotypes in the graph. You can find more information on vg rna and its options here: https://github.com/vgteam/vg/wiki/Transcriptomic-analyses

Next you will need to create a GCSA and distance index of the spliced pangenome graph which is needed by vg mpmap. You can find more information about how to create those here: https://github.com/vgteam/vg/wiki/Index-Construction (you likely need to prune the graph before constructing the GCSA index).

You can also find the scripts used in the mpmap-rpvg paper for indexing and mapping here: https://github.com/jonassibbesen/vgrna-project-paper/tree/main/scripts/bash/map_reads/vg

Regarding your earlier question about splitting by chromosomes/contigs. Yes, this was only done in the paper to reduce computational resources.

lsoldini commented 2 years ago

Hi, thanks a lot for the detailed answer, I'm glad to read that this should work!

One aspect that is still puzzling me is how this will work with the fact that I have two .gff (one for each assembly) ? I haven't read (but I may have missed it) that it was possible to give multiple files for the -n parameter, although the help says that it may repeat. Can I simply give the two .gff files with something like -n annotation1.gff, annotation2.gff ?

Best, Luca

jonassibbesen commented 2 years ago

You can give it two or more annotations like this: -n annotation1.gff -n annotation2.gff

lsoldini commented 2 years ago

Perfect, thank you ! I think I will not close this right away and update whether I managed to reach the final goal.

lsoldini commented 2 years ago

@jonassibbesen I am planning to use the below command, would you please have the time to quickly check whether this seems right ?

Get splice-pangenome and pantranscriptome (done)

vg convert -G -p graph.gfa > graph.pg
vg rna -p -n Fsel1.gff -n Fsel2.gff -s Parent -r -a -u -b pantranscriptome.gbwt -i pantranscriptome.txt graph.pg > splicepangenome.pg

For this, I am not sure I should have used the -a in vg rna.

Get .gcsa index (done)

vg mod -X 256 splicepangenome.pg > splicepangenome_mod.pg
vg prune splicepangenome_mod.pg > splicepangenome_mod_pruned.pg
vg index -p -g splicepangenome_mod_pruned.gcsa splicepangenome_mod_pruned.pg

Get distance and .xg (done)

vg index -x splicepangenome.xg splicepangenome_mod.pg
vg snarls -T splicepangenome.xg > splicepangenome.snarls
vg index -x splicepangenome.xg -s splicepangenome.snarls -j splicepangenome.dist

Update: this has worked properly.

jonassibbesen commented 2 years ago

Hi, the node chopping step should be moved to before the vg rna step, but I can see from a different issue that you are already aware of this. It is also possible to chop nodes using the -k option in vg rna.

The -a in vg rna is only relevant if you project transcripts between haplotypes, which is not needed in your case since you already have an annotation for each haplotype. The option will just be ignored so there is no need for you to rerun the pipeline without it.

lsoldini commented 2 years ago

Hi, yes this was older than the other issue, and I only realized later this was actually not working as intended. Still, thank you for your reply! (and I will change my script to remove -a and add -k to remove the vg mod step)

lsoldini commented 2 years ago

Short question: what would be the faster/better way to incorporate reads (.fastq) from different technical replicates (sequencing lanes) into one unique file ? Should I cat before vg mpmap or cat the .gamp output ?

jeizenga commented 2 years ago

I would expect both options to have a similar speed. If the fragment length distributions vary between replicates, you might get somewhat better mappings if you map them with separate calls to vg mpmap, because the fragment length distribution will be re-fit separately in each run.

lsoldini commented 2 years ago

Okay nice then, that's how I did before wondering about this. Thanks for the reply!

jeizenga commented 2 years ago

It looks to me like this issue is resolved, but let me know if you still have questions and I can re-open it.

lsoldini commented 2 years ago

While looking more carefully if anything went wrong, I have noticed that the pantranscritpome loses few genes (7 I think) from the .gff I am giving it. But I think I will re-open a new issue specific for this (in rpvg, issue 38).