vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.12k stars 194 forks source link

Error signal 6 vg rna #3659

Closed lsoldini closed 2 years ago

lsoldini commented 2 years ago

1. What were you trying to do?

Construct spliced pangenome graph for further RNA-seq mapping with vg mpmap.

2. What did you want to happen?

Get a .pg spliced pangenome graph.

3. What actually happened?

[vg rna] Parsing graph file ...
[vg rna] Graph parsed in 4.46009 seconds, 0.898975 GB
[vg rna] Adding transcript splice-junctions and exon boundaries to graph ...
vg: src/transcriptome.cpp:526: int32_t vg::Transcriptome::parse_transcripts(std::vector<vg::Transcript>*, st
d::istream*, const bdsg::PositionOverlay&, const gbwt::GBWT&, bool) const: Assertion `element == "exon"' fai
led.
ERROR: Signal 6 occurred. VG has crashed. Visit https://github.com/vgteam/vg/issues/new/choose to report a b
ug.

4. If you got a line like Stack trace path: /somewhere/on/your/computer/stacktrace.txt, please copy-paste the contents of that file here:

Crash report for vg v1.39.0 "Runzi"
Stack trace (most recent call last):
#10   Object "/users/lsoldini/.conda/envs/vg/bin/vg", at 0x5b858d, in _start
#9    Object "/users/lsoldini/.conda/envs/vg/bin/vg", at 0x1ccaeef, in __libc_start_main
#8    Object "/users/lsoldini/.conda/envs/vg/bin/vg", at 0x58c32e, in main
#7    Object "/users/lsoldini/.conda/envs/vg/bin/vg", at 0xc7e1db, in vg::subcommand::Subcommand::operator()
(int, char**) const
#6    Object "/users/lsoldini/.conda/envs/vg/bin/vg", at 0xc51e27, in main_rna(int, char**)
#5    Object "/users/lsoldini/.conda/envs/vg/bin/vg", at 0x1138dbf, in vg::Transcriptome::add_reference_tran
scripts(std::vector<std::istream*, std::allocator<std::istream*> >, std::unique_ptr<gbwt::GBWT, std::default
_delete<gbwt::GBWT> >&, bool, bool)
#4    Object "/users/lsoldini/.conda/envs/vg/bin/vg", at 0x1125efa, in vg::Transcriptome::parse_transcripts(
std::vector<vg::Transcript, std::allocator<vg::Transcript> >*, std::istream*, bdsg::PositionOverlay const&,
gbwt::GBWT const&, bool) const [clone .constprop.0]
#3    Object "/users/lsoldini/.conda/envs/vg/bin/vg", at 0x1cdb295, in __assert_fail
#2    Object "/users/lsoldini/.conda/envs/vg/bin/vg", at 0x58b727, in __assert_fail_base.cold
#1    Object "/users/lsoldini/.conda/envs/vg/bin/vg", at 0x58b857, in abort
#0    Object "/users/lsoldini/.conda/envs/vg/bin/vg", at 0x138404b, in raise

5. What data and command can the vg dev team use to make the problem happen?

vg rna -p -t 10 -e -n Fsel1#1.gff full_genome_2.pg > splice_pangenome.pg

The full_genome_2.pg was created using vg construct | vg convert from .fasta and .vcf file.

The gff looks like this:

##gff-version 3
Fsel1#1#ctg229_frg1_d   maker   gene    241     3010    .       +       .       ID=FSIM_CM229_00006E;Alias=maker-Fsel1#1#ctg229_frg1_d-snap-gene-0.16;Name=FSIM_CM229_00006E
Fsel1#1#ctg229_frg1_d   maker   mRNA    241     3010    .       +       .       ID=FSIM_CM229_00006E-RA;Parent=FSIM_CM229_00006E;Alias=maker-Fsel1#1#ctg229_frg1_d-snap-gene-0.16-mRNA-1;Name=FSIM_CM229_00006E-RA;_AED=0.73;_QI=0|0.33|0|0.5|0|0|4|0|264;_eAED=0.73

6. What does running vg version say?

vg version v1.39.0 "Runzi"
Compiled with g++ (Ubuntu 9.4.0-1ubuntu1~20.04) 9.4.0 on Linux
Linked against libstd++ 20210601
Built by stephen@lubuntu
jonassibbesen commented 2 years ago

Hi,

This error happens because vg rna expects that the ID attribute has the following syntax for exons: ID=exon:<transcript_id>:<exon_number> (e.g. ID=exon:ENST00000456328.2:1). It is a bug that it crashes if this is not the case since it is not a requirement of the gff3 format as far as I know. I will work on a fix for this. In the meantime you should be able to get it working by just removing the ID attribute from your annotation. vg rna only uses the ID attribute (if available) to ensure that the exons are sorted in correct order.

Best,

Jonas

lsoldini commented 2 years ago

Hi,

Thank you for your answer.

I tried with removing the ID attribute, so my .gff3 now looks like this:

##gff-version   3
Fsel1#1#ctg229_frg1_d   maker   gene    241     3010    .       +       .       Alias=maker-Fsel1#1#ctg229_frg1_d-snap-gene-0.16;Name=FSIM_CM229_00006E
Fsel1#1#ctg229_frg1_d   maker   mRNA    241     3010    .       +       .       Parent=FSIM_CM229_00006E;Alias=maker-Fsel1#1#ctg229_frg1_d-snap-gene-0.16-mRNA-1;Name=FSIM_CM229_00006E-RA;_AED=0.7
3;_QI=0|0.33|0|0.5|0|0|4|0|264;_eAED=0.73
Fsel1#1#ctg229_frg1_d   maker   exon    241     624     .       +       .       Parent=FSIM_CM229_00006E-RA
Fsel1#1#ctg229_frg1_d   maker   exon    712     892     .       +       .       Parent=FSIM_CM229_00006E-RA
Fsel1#1#ctg229_frg1_d   maker   exon    1332    1549    .       +       .       Parent=FSIM_CM229_00006E-RA
Fsel1#1#ctg229_frg1_d   maker   exon    2999    3010    .       +       .       Parent=FSIM_CM229_00006E-RA

And with the command line you suggested in my other issue #3653:

vg rna -p -n Fsel1#1.gff -r -u -b pantranscriptome.gbwt -i pantranscriptome.txt full_genome_2.pg > spliced_graph.pg

I got :

[vg rna] Parsing graph file ...
[vg rna] Graph parsed in 4.56569 seconds, 0.898983 GB
[vg rna] Adding transcript splice-junctions and exon boundaries to graph ...
[transcriptome] ERROR: Tag "transcript_id" not found in attributes (line 4).

Do you think I would need to format the ID field as wanted by vg rna ?

Luca

lsoldini commented 2 years ago

I've changed my .gff file so that exons looks as described:

##gff-version 3
Fsel1#1#ctg229_frg1_d   maker   gene    241     3010    .       +       .       ID=FSIM_CM229_00006E;Alias=maker-Fsel1#1#ctg229_frg1_d-snap-gene-0.16
;Name=FSIM_CM229_00006E
Fsel1#1#ctg229_frg1_d   maker   mRNA    241     3010    .       +       .       ID=FSIM_CM229_00006E-RA;Parent=FSIM_CM229_00006E;Alias=maker-Fsel1#1#
ctg229_frg1_d-snap-gene-0.16-mRNA-1;Name=FSIM_CM229_00006E-RA;_AED=0.73;_QI=0|0.33|0|0.5|0|0|4|0|264;_eAED=0.73
Fsel1#1#ctg229_frg1_d   maker   exon    241     624     .       +       .       ID=exon:FSIM_CM229_00006E-RA:1;Parent=FSIM_CM229_00006E-RA
Fsel1#1#ctg229_frg1_d   maker   exon    712     892     .       +       .       ID=exon:FSIM_CM229_00006E-RA:2;Parent=FSIM_CM229_00006E-RA
Fsel1#1#ctg229_frg1_d   maker   exon    1332    1549    .       +       .       ID=exon:FSIM_CM229_00006E-RA:3;Parent=FSIM_CM229_00006E-RA
Fsel1#1#ctg229_frg1_d   maker   exon    2999    3010    .       +       .       ID=exon:FSIM_CM229_00006E-RA:4;Parent=FSIM_CM229_00006E-RA
Fsel1#1#ctg229_frg1_d   maker   CDS     241     624     .       +       0       ID=FSIM_CM229_00006E-RA:cds;Parent=FSIM_CM229_00006E-RA
Fsel1#1#ctg229_frg1_d   maker   CDS     712     892     .       +       0       ID=FSIM_CM229_00006E-RA:cds;Parent=FSIM_CM229_00006E-RA
Fsel1#1#ctg229_frg1_d   maker   CDS     1332    1549    .       +       2       ID=FSIM_CM229_00006E-RA:cds;Parent=FSIM_CM229_00006E-RA
Fsel1#1#ctg229_frg1_d   maker   CDS     2999    3010    .       +       0       ID=FSIM_CM229_00006E-RA:cds;Parent=FSIM_CM229_00006E-RA

But I get the same error message :

[vg rna] Parsing graph file ...
[vg rna] Graph parsed in 4.28164 seconds, 0.898983 GB
[vg rna] Adding transcript splice-junctions and exon boundaries to graph ...
[transcriptome] ERROR: Tag "transcript_id" not found in attributes (line 3).
lsoldini commented 2 years ago

Since "By default only lines with the exon feature (column 4) will be parsed" (https://github.com/vgteam/vg/wiki/Transcriptomic-analyses), I tried with keeping only the exons in the .gff3.

This time, it did not throw any errors, but it neither parsed a transcript nor added a path on the graph:

[vg rna] Parsing graph file ...
[vg rna] Graph parsed in 4.54644 seconds, 0.898983 GB
[vg rna] Adding transcript splice-junctions and exon boundaries to graph ...
[vg rna] 0 transcripts parsed and graph augmented in 12.8608 seconds, 3.77022 GB
[vg rna] Topological sorting graph and compacting node ids ...
[vg rna] Graph sorted and compacted in 27.6161 seconds, 3.77022 GB
[vg rna] Adding reference transcripts as embedded paths in the graph ...
[vg rna] 0 paths added in 3.04803e-05 seconds, 3.77022 GB
[vg rna] Writing pantranscriptome (reference and haploype transcripts) to file(s) ...
[vg rna] Writing splicing graph to stdout ...
[vg rna] Graph and pantranscriptome written in 4.79321 seconds, 3.77022 GB
jonassibbesen commented 2 years ago

By default vg rna assumes that the transcript name (or id) is giving by the attribute transcript_id. You can change this using the option -s, --transcript-tag. Based on the annotation you shared it looks like the transcript name is giving by the Parent attribute (-s Parent).

lsoldini commented 2 years ago

Ok I've read that and I thought it would be -s ID ... sorry for the silly questions and thank you very much, I'll try this asap !

lsoldini commented 2 years ago

So, this worked ! I mean, my two .gff were not properly ordered (exon order), and I finally ended up removing the ID field, but this worked.

However

I still had an error:

(vg) [lsoldini@dna020 vg_rna]$ vg rna -p -n Fsel1#1.gff -n Fsel2#2.gff -s Parent -r -u -b pantranscriptome.gbwt -i pantranscriptome.txt full_genome_2.pg > spliced_graph.pg
[vg rna] Parsing graph file ...
[vg rna] Graph parsed in 4.1638 seconds, 0.898983 GB
[vg rna] Adding transcript splice-junctions and exon boundaries to graph ...
[transcriptome] ERROR: Chromomsome path "Fsel2#2#ctg855_d" not found in graph or haplotypes index (line 2).

Question

How am I supposed to add the non-reference transcript as embedded-paths ?

What I've tried

If I understood correctly, this is because I do not have the region field of the second .gff (second in the sens the one corresponding to variants provided as a .vcf) as embedded-paths in the graph. I've seen that with -r we are adding the reference-transcripts as embedded-paths in the graph. I think then I should use -a to add the haplotype transcripts as embedded paths. I try with adding this parameter but I had the same error.

Or, maybe, it is when I built the graph with vg construct that I should have done things differently. I have used the -a / --alt-paths parameter, but maybe I shouldn't have ? Or maybe I should have used -n / --rename V=F in addition ? This is what I used:

vg construct -C -a -v Fsel1.filt.vcf.gz -r Fsel1#1.fasta > full_genome.vg
jonassibbesen commented 2 years ago

vg rna gives this error when one of the chromosomes specified in the input annotation file (column 1) is not found as a path in the graph. vg rna needs the chromosome paths as coordinates otherwise it will not know where the transcripts are located in the graph. The paths can either be embedded in the graph (contained in the .pg graph file) or be given to vg rna as a GBWT file. When using vg construct only the paths in the reference fasta will be added to the graph.

One possible solution for this would be to use the graph from pggb (.gfa) instead, which I think should contain the paths for each assembly. Alternatively, if you have the genotypes for the second assembly in the vcf you can also try to first construct a GBWT file of the second assembly haplotypes and then add them to the graph:

vg gbwt -x full_genome_2.pg -v Fsel1.filt.vcf.gz -o Fsel1.filt.gbwt

vg paths -S <sample> -x full_genome_2.pg -g Fsel1.filt.gbwt -A | vg augment -F -B full_genome_2.pg - > full_genome_2_paths.pg

Here <sample> is the name of the sample in the vcf file that you want to add to the graph. You will likely need to rename the chromosomes in the annotation afterwards since they will probably not match the names of the new paths in the graph. You can see the names of the paths in a graph using:

vg paths -L -v full_genome_2_paths.pg | grep -v "_alt"

lsoldini commented 2 years ago

Thank you for your answer !

About using the .gfa, I thought that this would not be supported since pggb creates complex graph regions (also from feedback I received in #3653) ? But if you think this may work I will definitely give it a try.

For the alternative, I am not sure whether the .vcf I have is correctly giving the genotype in the FORMAT field and the "Fsel2" column (this is the vcf for alternates alleles of assembly Fsel1):

##fileformat=VCFv4.2
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=CONFLICT,Number=.,Type=String,Description="Sample names for which there are multiple paths in the graph with conflicting alleles">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Total number of alternate alleles in called genotypes">
##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated allele frequency in the range (0,1]">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of samples with data">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=LV,Number=1,Type=Integer,Description="Level in the snarl tree (0=top level)">
##INFO=<ID=PS,Number=1,Type=String,Description="ID of variant corresponding to parent snarl">
##INFO=<ID=AT,Number=R,Type=String,Description="Allele Traversal as path in graph">
##contig=<ID=Fsel1#1#scf7,length=12339748>
##contig=<ID=Fsel1#1#scf5,length=12127983>
##contig=<ID=Fsel1#1#scf2,length=11903705>
##contig=<ID=Fsel1#1#scf6,length=11417000>

... (list of all scaffolds and contigs) ...

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Fsel2
Fsel1#1#ctg1003_frg1_d  287     >1087507>1087510        C       T       60      .       AC=1;AF=1;AN=1;AT=>1087507>1087509>1087510,>1087507>1087508>1087510;NS=1;LV=0   GT      1
Fsel1#1#ctg1003_frg1_d  510     >1087510>1087512        CCA     C       60      .       AC=1;AF=1;AN=1;AT=>1087510>1087511>1087512,>1087510>1087512;NS=1;LV=0   GT      1
Fsel1#1#ctg1003_frg1_d  858     >1087512>1087514        A       ATCTCTC 60      .       AC=1;AF=1;AN=1;AT=>1087512>1087514,>1087512>1087513>1087514;NS=1;LV=0   GT      1
Fsel1#1#ctg1003_frg1_d  969     >1087514>1087516        T       TC      60      .       AC=1;AF=1;AN=1;AT=>1087514>1087516,>1087514>1087515>1087516;NS=1;LV=0   GT      1
Fsel1#1#ctg1003_frg1_d  974     >1087516>1087518        G       GC      60      .       AC=1;AF=1;AN=1;AT=>1087516>1087518,>1087516>1087517>1087518;NS=1;LV=0   GT      1

My .vcf files were done with pggb: I used the -V Fsel1:#,Fsel2:# parameter such that it runs:

vg deconstruct -P Fsel1 -H # -e -a -t 20 /full_genome_3/results/both.fasta.gz.e792acf.e34d4cd.b972caa.smooth.final.gfa > Fsel1.vcf 
vg deconstruct -P Fsel2 -H # -e -a -t 20 /full_genome_3/results/both.fasta.gz.e792acf.e34d4cd.b972caa.smooth.final.gfa > Fsel2.vcf

Also, I used the PanSN-spec to name the scaffolds and contigs of my two assemblies e.g. Fsel1#1#scfx and Fsel2#1#scfx for the scaffold x.

So, at the end, I have a Fsel1.fasta, Fsel1.vcf (the alternate alleles of Fsel1), and Fsel1.gff. As well as these three files for Fsel2.

I tried the command line : vg gbwt -x full_genome_2.pg -v Fsel1.filt.vcf.gz -o Fsel1.filt.gbwt but it throws an error because error: [vg index] expected 1 samples, got 0. I also tried by adding -S Fsel2 to vg gbwt but this did not help.

Finally, I will look closer to the vg gbwt and vg paths commands to see whether I manage to make this works. I also look what it gives to use the .gfa directly.

lsoldini commented 2 years ago

Since this was not working with the .vcf coming from pggb, I started with .gfa and used vg deconstruct "manually". I also skipped the vcfbub step I was previously doing. The commands you suggested have runned smoothly. (I will later try directly with the .gfa)

However, I am quite unsure about the final output. When I grep -v "alt" for the path in the resulting .pg graph, I can see the paths for the first assembly (e.g., Fsel1#1#ctg1552_d) and some for the second assembly (e.g., _thread_Fsel2_Fsel1#1#scf22_0_2). So, this is fine, I can just rename. However, it does not seems to make sense because I have many entries looking like this:

_thread_Fsel2_Fsel1#1#ctg1664_frg2_0_0
_thread_Fsel2_Fsel1#1#ctg1206_0_0
_thread_Fsel2_Fsel1#1#ctg1206_0_1
_thread_Fsel2_Fsel1#1#ctg1206_0_2
_thread_Fsel2_Fsel1#1#ctg1442_0_0
_thread_Fsel2_Fsel1#1#ctg1248_0_0
_thread_Fsel2_Fsel1#1#ctg1023_frg2_d_0_0
_thread_Fsel2_Fsel1#1#ctg1023_frg2_d_0_1
_thread_Fsel2_Fsel1#1#ctg1372_0_0

Those contig names are specific to the Fsel1 assembly, and then it does not seems to make sense to have those there ? How can I link the given name back to the names from the Fsel2 assembly?

PS: Maybe this is something I should have mentionned earlier (sorry for that), but I have contigs that are specific to the first assembly and other to the second.

lsoldini commented 2 years ago

Well, this worked smoothly and it was fully straightforward with the .gfa ... I feel a bit stupid to have tried so hard with the .vcf approach (but I hope having those nested snarls in my graph won't be a problem downstream with the index etc.). At least as far as I can tell, the .gfa from pggb seemed to have been fully supported by vg convert (also all paths were present in the .pg).

lsoldini commented 1 year ago

@jonassibbesen I am re-opening this because I had similar issue, but I am now working with pggb 0.5.2 and vg 1.44.0 (surprinsingly -at least for me-, I re-run the same scripts as before, but this throw me errors).

I have runned pggb to get a .gfa of my assemblies. However, when using vg toolkit, I am having issues with Chromosome path not being found - would you have any idea of why this happens?

vg convert -p -g genome.gfa > genome.pg
vg mod -t 10 -X 256 genome.pg > genome_mod.pg
vg rna -p -t 10 -n Fsel1_formatted.gff -n Fsel2_formatted.gff -s Parent -c all -u -b pantranscriptome.gbwt -i pantranscriptome.txt genome_mod.pg > splicepangenome.pg

The error is: ERROR: Chromomsome path "Fsel1#1#ctg229_frg1_d" not found in graph or haplotypes index (line 2).

My .gff should be correctly formated:

##gff-version 3
Fsel1#1#ctg229_frg1_d   maker   exon    241     624     .       +       .       Parent=FSIM_CM229_00006E-RA
Fsel1#1#ctg229_frg1_d   maker   exon    712     892     .       +       .       Parent=FSIM_CM229_00006E-RA
Fsel1#1#ctg229_frg1_d   maker   exon    1332    1549    .       +       .       Parent=FSIM_CM229_00006E-RA
Fsel1#1#ctg229_frg1_d   maker   exon    2999    3010    .       +       .       Parent=FSIM_CM229_00006E-RA

And the path "Fsel1#1#ctg229_frg1d" is seeminlgy found within the .gfa_ that I am using:

P       Fsel1#1#ctg229_frg1_d   1097187-,1097185-,1097183-,1097181-,1097180-,1097179-,1097178-,1097176-,1097174-,1097172-,1097171-,1097169-,1097167-,1097166-,1097165-,1332711+   *
P       Fsel1#1#ctg229_frg3_d   9246+,9247+,9249+,9250+,9252+,9253+,9255+,9256+,9257+,9259+,9260+,9262+,9263+,9265+,9267+,9268+,9270+,9271+,9273+,9274+,9276+,9277+,9279+,9280+,9282+,9283+,9285+,9286+,9288+,9289+,9291+,9292+,9294+,9295+,9297+,9298+,9300+,9301+,9303+,9304+,9305+,9306+,9308+,9309+,9311+,9312+,9314+,9315+,9317+,9318+,9320+,9321+,9322+,9323+,9324+,9325+,9327+,9328+,9330+,9332+,9333+,9335+,9336+,9338+,9339+,9341+,9342+,9344+,9345+,9347+,9348+,9350+,9351+,9353+,9354+,9356+,9357+,9359+,9361+,9362+,9364+,9365+,9366+,9367+,9369+,9371+,9372+,9374+,9375+,9376+,9377+,9378+,9379+,9381+,9382+,9384+,9385+,9387+,9388+,9390+,9391+,9393+,9394+,9396+,9397+,9399+,9400+,9402+,9404+,9406+,9407+,9408+,9409+,9411+,9412+,9413+,9415+,9416+,9419+,9421+,9423+,9424+,9426+,9428+,9429+,9431+,9432+,9434+,9435+,9437+,9438+,9440+,9441+,9443+,9444+,9446+,9447+,9449+,9450+,9451+,9452+,9453+,9455+,9456+,9459+,9461+,9462+,9464+,9465+,9467+,9469+,9472+,9473+,9474+,9475+,9476+,9477+,9479+,9480+,9482+,9483+,9485+,9487+,9488+,9489+,9491+,9492+,9493+,9494+,9495+,9496+ *
subwaystation commented 1 year ago

@lsoldini See https://github.com/vgteam/vg/issues/3807. Please try vg 1.40.0.

lsoldini commented 1 year ago

Well that is a problem, I need vg 1.44.0 for some new feature in vg rna (#3806).

I'll try using vg convert and vg mod of vg 1.40.0 and then vg 1.44.0. It worked smoothly, thanks!

lsoldini commented 1 year ago

Why is it that vg index does no longer have -s parameter in vg 1.44.0 ?

I wanted to build a distance index with vg index and -s file.snarls (originating from vg snarls). How should I proceed in vg 1.44.0 ? I looked for changes in vg index in the release but did not find what has changed.

jeizenga commented 1 year ago

I believe the snarls are now computed on-the-fly, so you don't need to provide them as input.