Closed lsoldini closed 2 years ago
Hi,
This error happens because vg rna
expects that the ID
attribute has the following syntax for exons: ID=exon:<transcript_id>:<exon_number>
(e.g. ID=exon:ENST00000456328.2:1
). It is a bug that it crashes if this is not the case since it is not a requirement of the gff3 format as far as I know. I will work on a fix for this. In the meantime you should be able to get it working by just removing the ID
attribute from your annotation. vg rna
only uses the ID
attribute (if available) to ensure that the exons are sorted in correct order.
Best,
Jonas
Hi,
Thank you for your answer.
I tried with removing the ID
attribute, so my .gff3 now looks like this:
##gff-version 3
Fsel1#1#ctg229_frg1_d maker gene 241 3010 . + . Alias=maker-Fsel1#1#ctg229_frg1_d-snap-gene-0.16;Name=FSIM_CM229_00006E
Fsel1#1#ctg229_frg1_d maker mRNA 241 3010 . + . Parent=FSIM_CM229_00006E;Alias=maker-Fsel1#1#ctg229_frg1_d-snap-gene-0.16-mRNA-1;Name=FSIM_CM229_00006E-RA;_AED=0.7
3;_QI=0|0.33|0|0.5|0|0|4|0|264;_eAED=0.73
Fsel1#1#ctg229_frg1_d maker exon 241 624 . + . Parent=FSIM_CM229_00006E-RA
Fsel1#1#ctg229_frg1_d maker exon 712 892 . + . Parent=FSIM_CM229_00006E-RA
Fsel1#1#ctg229_frg1_d maker exon 1332 1549 . + . Parent=FSIM_CM229_00006E-RA
Fsel1#1#ctg229_frg1_d maker exon 2999 3010 . + . Parent=FSIM_CM229_00006E-RA
And with the command line you suggested in my other issue #3653:
vg rna -p -n Fsel1#1.gff -r -u -b pantranscriptome.gbwt -i pantranscriptome.txt full_genome_2.pg > spliced_graph.pg
I got :
[vg rna] Parsing graph file ...
[vg rna] Graph parsed in 4.56569 seconds, 0.898983 GB
[vg rna] Adding transcript splice-junctions and exon boundaries to graph ...
[transcriptome] ERROR: Tag "transcript_id" not found in attributes (line 4).
Do you think I would need to format the ID field as wanted by vg rna
?
Luca
I've changed my .gff file so that exons looks as described:
##gff-version 3
Fsel1#1#ctg229_frg1_d maker gene 241 3010 . + . ID=FSIM_CM229_00006E;Alias=maker-Fsel1#1#ctg229_frg1_d-snap-gene-0.16
;Name=FSIM_CM229_00006E
Fsel1#1#ctg229_frg1_d maker mRNA 241 3010 . + . ID=FSIM_CM229_00006E-RA;Parent=FSIM_CM229_00006E;Alias=maker-Fsel1#1#
ctg229_frg1_d-snap-gene-0.16-mRNA-1;Name=FSIM_CM229_00006E-RA;_AED=0.73;_QI=0|0.33|0|0.5|0|0|4|0|264;_eAED=0.73
Fsel1#1#ctg229_frg1_d maker exon 241 624 . + . ID=exon:FSIM_CM229_00006E-RA:1;Parent=FSIM_CM229_00006E-RA
Fsel1#1#ctg229_frg1_d maker exon 712 892 . + . ID=exon:FSIM_CM229_00006E-RA:2;Parent=FSIM_CM229_00006E-RA
Fsel1#1#ctg229_frg1_d maker exon 1332 1549 . + . ID=exon:FSIM_CM229_00006E-RA:3;Parent=FSIM_CM229_00006E-RA
Fsel1#1#ctg229_frg1_d maker exon 2999 3010 . + . ID=exon:FSIM_CM229_00006E-RA:4;Parent=FSIM_CM229_00006E-RA
Fsel1#1#ctg229_frg1_d maker CDS 241 624 . + 0 ID=FSIM_CM229_00006E-RA:cds;Parent=FSIM_CM229_00006E-RA
Fsel1#1#ctg229_frg1_d maker CDS 712 892 . + 0 ID=FSIM_CM229_00006E-RA:cds;Parent=FSIM_CM229_00006E-RA
Fsel1#1#ctg229_frg1_d maker CDS 1332 1549 . + 2 ID=FSIM_CM229_00006E-RA:cds;Parent=FSIM_CM229_00006E-RA
Fsel1#1#ctg229_frg1_d maker CDS 2999 3010 . + 0 ID=FSIM_CM229_00006E-RA:cds;Parent=FSIM_CM229_00006E-RA
But I get the same error message :
[vg rna] Parsing graph file ...
[vg rna] Graph parsed in 4.28164 seconds, 0.898983 GB
[vg rna] Adding transcript splice-junctions and exon boundaries to graph ...
[transcriptome] ERROR: Tag "transcript_id" not found in attributes (line 3).
Since "By default only lines with the exon feature (column 4) will be parsed" (https://github.com/vgteam/vg/wiki/Transcriptomic-analyses), I tried with keeping only the exons in the .gff3.
This time, it did not throw any errors, but it neither parsed a transcript nor added a path on the graph:
[vg rna] Parsing graph file ...
[vg rna] Graph parsed in 4.54644 seconds, 0.898983 GB
[vg rna] Adding transcript splice-junctions and exon boundaries to graph ...
[vg rna] 0 transcripts parsed and graph augmented in 12.8608 seconds, 3.77022 GB
[vg rna] Topological sorting graph and compacting node ids ...
[vg rna] Graph sorted and compacted in 27.6161 seconds, 3.77022 GB
[vg rna] Adding reference transcripts as embedded paths in the graph ...
[vg rna] 0 paths added in 3.04803e-05 seconds, 3.77022 GB
[vg rna] Writing pantranscriptome (reference and haploype transcripts) to file(s) ...
[vg rna] Writing splicing graph to stdout ...
[vg rna] Graph and pantranscriptome written in 4.79321 seconds, 3.77022 GB
By default vg rna
assumes that the transcript name (or id) is giving by the attribute transcript_id. You can change this using the option -s, --transcript-tag
. Based on the annotation you shared it looks like the transcript name is giving by the Parent attribute (-s Parent
).
Ok I've read that and I thought it would be -s ID
... sorry for the silly questions and thank you very much, I'll try this asap !
So, this worked ! I mean, my two .gff were not properly ordered (exon order), and I finally ended up removing the ID field, but this worked.
However
I still had an error:
(vg) [lsoldini@dna020 vg_rna]$ vg rna -p -n Fsel1#1.gff -n Fsel2#2.gff -s Parent -r -u -b pantranscriptome.gbwt -i pantranscriptome.txt full_genome_2.pg > spliced_graph.pg
[vg rna] Parsing graph file ...
[vg rna] Graph parsed in 4.1638 seconds, 0.898983 GB
[vg rna] Adding transcript splice-junctions and exon boundaries to graph ...
[transcriptome] ERROR: Chromomsome path "Fsel2#2#ctg855_d" not found in graph or haplotypes index (line 2).
Question
How am I supposed to add the non-reference transcript as embedded-paths ?
What I've tried
If I understood correctly, this is because I do not have the region field of the second .gff (second in the sens the one corresponding to variants provided as a .vcf) as embedded-paths in the graph. I've seen that with -r
we are adding the reference-transcripts as embedded-paths in the graph. I think then I should use -a
to add the haplotype transcripts as embedded paths. I try with adding this parameter but I had the same error.
Or, maybe, it is when I built the graph with vg construct
that I should have done things differently. I have used the -a
/ --alt-paths
parameter, but maybe I shouldn't have ? Or maybe I should have used -n
/ --rename V=F
in addition ?
This is what I used:
vg construct -C -a -v Fsel1.filt.vcf.gz -r Fsel1#1.fasta > full_genome.vg
vg rna
gives this error when one of the chromosomes specified in the input annotation file (column 1) is not found as a path in the graph. vg rna
needs the chromosome paths as coordinates otherwise it will not know where the transcripts are located in the graph. The paths can either be embedded in the graph (contained in the .pg graph file) or be given to vg rna
as a GBWT file. When using vg construct
only the paths in the reference fasta will be added to the graph.
One possible solution for this would be to use the graph from pggb
(.gfa) instead, which I think should contain the paths for each assembly. Alternatively, if you have the genotypes for the second assembly in the vcf you can also try to first construct a GBWT file of the second assembly haplotypes and then add them to the graph:
vg gbwt -x full_genome_2.pg -v Fsel1.filt.vcf.gz -o Fsel1.filt.gbwt
vg paths -S <sample> -x full_genome_2.pg -g Fsel1.filt.gbwt -A | vg augment -F -B full_genome_2.pg - > full_genome_2_paths.pg
Here <sample>
is the name of the sample in the vcf file that you want to add to the graph. You will likely need to rename the chromosomes in the annotation afterwards since they will probably not match the names of the new paths in the graph. You can see the names of the paths in a graph using:
vg paths -L -v full_genome_2_paths.pg | grep -v "_alt"
Thank you for your answer !
About using the .gfa, I thought that this would not be supported since pggb
creates complex graph regions (also from feedback I received in #3653) ? But if you think this may work I will definitely give it a try.
For the alternative, I am not sure whether the .vcf I have is correctly giving the genotype in the FORMAT field and the "Fsel2" column (this is the vcf for alternates alleles of assembly Fsel1):
##fileformat=VCFv4.2
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=CONFLICT,Number=.,Type=String,Description="Sample names for which there are multiple paths in the graph with conflicting alleles">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Total number of alternate alleles in called genotypes">
##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated allele frequency in the range (0,1]">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of samples with data">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=LV,Number=1,Type=Integer,Description="Level in the snarl tree (0=top level)">
##INFO=<ID=PS,Number=1,Type=String,Description="ID of variant corresponding to parent snarl">
##INFO=<ID=AT,Number=R,Type=String,Description="Allele Traversal as path in graph">
##contig=<ID=Fsel1#1#scf7,length=12339748>
##contig=<ID=Fsel1#1#scf5,length=12127983>
##contig=<ID=Fsel1#1#scf2,length=11903705>
##contig=<ID=Fsel1#1#scf6,length=11417000>
... (list of all scaffolds and contigs) ...
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Fsel2
Fsel1#1#ctg1003_frg1_d 287 >1087507>1087510 C T 60 . AC=1;AF=1;AN=1;AT=>1087507>1087509>1087510,>1087507>1087508>1087510;NS=1;LV=0 GT 1
Fsel1#1#ctg1003_frg1_d 510 >1087510>1087512 CCA C 60 . AC=1;AF=1;AN=1;AT=>1087510>1087511>1087512,>1087510>1087512;NS=1;LV=0 GT 1
Fsel1#1#ctg1003_frg1_d 858 >1087512>1087514 A ATCTCTC 60 . AC=1;AF=1;AN=1;AT=>1087512>1087514,>1087512>1087513>1087514;NS=1;LV=0 GT 1
Fsel1#1#ctg1003_frg1_d 969 >1087514>1087516 T TC 60 . AC=1;AF=1;AN=1;AT=>1087514>1087516,>1087514>1087515>1087516;NS=1;LV=0 GT 1
Fsel1#1#ctg1003_frg1_d 974 >1087516>1087518 G GC 60 . AC=1;AF=1;AN=1;AT=>1087516>1087518,>1087516>1087517>1087518;NS=1;LV=0 GT 1
My .vcf files were done with pggb
: I used the -V Fsel1:#,Fsel2:#
parameter such that it runs:
vg deconstruct -P Fsel1 -H # -e -a -t 20 /full_genome_3/results/both.fasta.gz.e792acf.e34d4cd.b972caa.smooth.final.gfa > Fsel1.vcf
vg deconstruct -P Fsel2 -H # -e -a -t 20 /full_genome_3/results/both.fasta.gz.e792acf.e34d4cd.b972caa.smooth.final.gfa > Fsel2.vcf
Also, I used the PanSN-spec to name the scaffolds and contigs of my two assemblies e.g. Fsel1#1#scfx and Fsel2#1#scfx for the scaffold x.
So, at the end, I have a Fsel1.fasta, Fsel1.vcf (the alternate alleles of Fsel1), and Fsel1.gff. As well as these three files for Fsel2.
I tried the command line :
vg gbwt -x full_genome_2.pg -v Fsel1.filt.vcf.gz -o Fsel1.filt.gbwt
but it throws an error because error: [vg index] expected 1 samples, got 0
. I also tried by adding -S Fsel2
to vg gbwt
but this did not help.
Finally, I will look closer to the vg gbwt
and vg paths
commands to see whether I manage to make this works. I also look what it gives to use the .gfa directly.
Since this was not working with the .vcf coming from pggb
, I started with .gfa and used vg deconstruct
"manually". I also skipped the vcfbub
step I was previously doing. The commands you suggested have runned smoothly. (I will later try directly with the .gfa)
However, I am quite unsure about the final output. When I grep -v "alt"
for the path in the resulting .pg graph, I can see the paths for the first assembly (e.g., Fsel1#1#ctg1552_d
) and some for the second assembly (e.g., _thread_Fsel2_Fsel1#1#scf22_0_2
). So, this is fine, I can just rename. However, it does not seems to make sense because I have many entries looking like this:
_thread_Fsel2_Fsel1#1#ctg1664_frg2_0_0
_thread_Fsel2_Fsel1#1#ctg1206_0_0
_thread_Fsel2_Fsel1#1#ctg1206_0_1
_thread_Fsel2_Fsel1#1#ctg1206_0_2
_thread_Fsel2_Fsel1#1#ctg1442_0_0
_thread_Fsel2_Fsel1#1#ctg1248_0_0
_thread_Fsel2_Fsel1#1#ctg1023_frg2_d_0_0
_thread_Fsel2_Fsel1#1#ctg1023_frg2_d_0_1
_thread_Fsel2_Fsel1#1#ctg1372_0_0
Those contig names are specific to the Fsel1 assembly, and then it does not seems to make sense to have those there ? How can I link the given name back to the names from the Fsel2 assembly?
PS: Maybe this is something I should have mentionned earlier (sorry for that), but I have contigs that are specific to the first assembly and other to the second.
Well, this worked smoothly and it was fully straightforward with the .gfa ... I feel a bit stupid to have tried so hard with the .vcf approach (but I hope having those nested snarls in my graph won't be a problem downstream with the index etc.). At least as far as I can tell, the .gfa from pggb
seemed to have been fully supported by vg convert
(also all paths were present in the .pg).
@jonassibbesen I am re-opening this because I had similar issue, but I am now working with pggb 0.5.2
and vg 1.44.0
(surprinsingly -at least for me-, I re-run the same scripts as before, but this throw me errors).
I have runned pggb to get a .gfa
of my assemblies. However, when using vg toolkit, I am having issues with Chromosome path not being found - would you have any idea of why this happens?
vg convert -p -g genome.gfa > genome.pg
vg mod -t 10 -X 256 genome.pg > genome_mod.pg
vg rna -p -t 10 -n Fsel1_formatted.gff -n Fsel2_formatted.gff -s Parent -c all -u -b pantranscriptome.gbwt -i pantranscriptome.txt genome_mod.pg > splicepangenome.pg
The error is:
ERROR: Chromomsome path "Fsel1#1#ctg229_frg1_d" not found in graph or haplotypes index (line 2).
My .gff should be correctly formated:
##gff-version 3
Fsel1#1#ctg229_frg1_d maker exon 241 624 . + . Parent=FSIM_CM229_00006E-RA
Fsel1#1#ctg229_frg1_d maker exon 712 892 . + . Parent=FSIM_CM229_00006E-RA
Fsel1#1#ctg229_frg1_d maker exon 1332 1549 . + . Parent=FSIM_CM229_00006E-RA
Fsel1#1#ctg229_frg1_d maker exon 2999 3010 . + . Parent=FSIM_CM229_00006E-RA
And the path "Fsel1#1#ctg229_frg1d" is seeminlgy found within the .gfa
_ that I am using:
P Fsel1#1#ctg229_frg1_d 1097187-,1097185-,1097183-,1097181-,1097180-,1097179-,1097178-,1097176-,1097174-,1097172-,1097171-,1097169-,1097167-,1097166-,1097165-,1332711+ *
P Fsel1#1#ctg229_frg3_d 9246+,9247+,9249+,9250+,9252+,9253+,9255+,9256+,9257+,9259+,9260+,9262+,9263+,9265+,9267+,9268+,9270+,9271+,9273+,9274+,9276+,9277+,9279+,9280+,9282+,9283+,9285+,9286+,9288+,9289+,9291+,9292+,9294+,9295+,9297+,9298+,9300+,9301+,9303+,9304+,9305+,9306+,9308+,9309+,9311+,9312+,9314+,9315+,9317+,9318+,9320+,9321+,9322+,9323+,9324+,9325+,9327+,9328+,9330+,9332+,9333+,9335+,9336+,9338+,9339+,9341+,9342+,9344+,9345+,9347+,9348+,9350+,9351+,9353+,9354+,9356+,9357+,9359+,9361+,9362+,9364+,9365+,9366+,9367+,9369+,9371+,9372+,9374+,9375+,9376+,9377+,9378+,9379+,9381+,9382+,9384+,9385+,9387+,9388+,9390+,9391+,9393+,9394+,9396+,9397+,9399+,9400+,9402+,9404+,9406+,9407+,9408+,9409+,9411+,9412+,9413+,9415+,9416+,9419+,9421+,9423+,9424+,9426+,9428+,9429+,9431+,9432+,9434+,9435+,9437+,9438+,9440+,9441+,9443+,9444+,9446+,9447+,9449+,9450+,9451+,9452+,9453+,9455+,9456+,9459+,9461+,9462+,9464+,9465+,9467+,9469+,9472+,9473+,9474+,9475+,9476+,9477+,9479+,9480+,9482+,9483+,9485+,9487+,9488+,9489+,9491+,9492+,9493+,9494+,9495+,9496+ *
@lsoldini See https://github.com/vgteam/vg/issues/3807. Please try vg 1.40.0
.
Well that is a problem, I need vg 1.44.0
for some new feature in vg rna
(#3806).
I'll try using vg convert
and vg mod
of vg 1.40.0
and then vg 1.44.0
. It worked smoothly, thanks!
Why is it that vg index
does no longer have -s
parameter in vg 1.44.0
?
I wanted to build a distance index with vg index
and -s file.snarls
(originating from vg snarls
). How should I proceed in vg 1.44.0
? I looked for changes in vg index
in the release but did not find what has changed.
I believe the snarls are now computed on-the-fly, so you don't need to provide them as input.
1. What were you trying to do?
Construct spliced pangenome graph for further RNA-seq mapping with
vg mpmap
.2. What did you want to happen?
Get a .pg spliced pangenome graph.
3. What actually happened?
4. If you got a line like
Stack trace path: /somewhere/on/your/computer/stacktrace.txt
, please copy-paste the contents of that file here:5. What data and command can the vg dev team use to make the problem happen?
The full_genome_2.pg was created using vg construct | vg convert from .fasta and .vcf file.
The gff looks like this:
6. What does running
vg version
say?