vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.13k stars 195 forks source link

Variant IDs when using VG RNA and VG CALL #3794

Open ashleethomson opened 2 years ago

ashleethomson commented 2 years ago

Hi, I'm currently creating an exon only spliced graph (NOT haplotype specific) using vg rna, and wanted to know if there was a way to make the IDs (contigs) the chromosome number instead of the transcript ID. I understand that when using vg rna there is an option to input a transcript file:

-n, --transcripts FILE transcript file(s) in gtf/gff format

(An example of my GTF is as follows)

1   pseudogene  gene    11869   14412   .   +   .   gene_source "ensembl_havana"; gene_biotype "pseudogene"; gene_id "ENSG00000223972"; gene_name "DDX11L1";
1   processed_transcript    transcript  11869   14409   .   +   .   transcript_source "havana"; gene_id "ENSG00000223972"; gene_source "ensembl_havana"; transcript_name "DDX11L1-002"; gene_biotype "pseudogene"; transcript_id "ENST00000456328"; gene_name "DDX11L1";

and the option to select which attribute tag to use as the ID:

-s, --transcript-tag NAME use this attribute tag in the gtf/gff file(s) as id [transcript_id]

but I want to use the chromosome number as the ID tag.

Currently when I use vg call on my graph, my VCFs look like this

##contig=<ID=ENST00000584536_R1,length=804>
##contig=<ID=ENST00000550993_R1,length=438>
##contig=<ID=ENST00000490417_R1,length=952>
ENST00000196061_R1  2489    >376099>117009861   T   C   15.5976 PASS    ...
ENST00000196061_R1  2640    >376103>118173663   T   C   12.7808 PASS    ...
ENST00000196061_R1  2662    >376104>105813788   A   T   12.5346 PASS    ...

But I want this format:

##contig=<ID=8,length=146364022>
##contig=<ID=9,length=141213431>
##contig=<ID=10,length=135534747>
1    10560   .       C       G       21.77   .       ...
1    13813   .       T       G       30.78   .       ...
1    14464   .       A       T       36.31   .       ...

Is there a way around this? Or is this specific to vg rna? My current code for vg rna is:

vg rna -p -d -o -r -n chr${i}.gtf chr${i}.vg > ${GRAPH_PREFIX}_${i}.vg

Any advice is welcome, and thanks in advance.

jonassibbesen commented 2 years ago

Hi, this is not possible when using an exon-only spliced graph. The reason for this is that when creating the exon-only graph, the graph is split into smaller disjoint subgraphs corresponding to genes. Because intergenic and intronic regions are removed as part of this process it is not possible to represent the full chromosome paths anymore in the graph and they are therefore removed.

One possible solution if you want calls in chromosome coordinates instead of transcript coordinates would be to use a full spliced graph:

vg rna -p -r -n chr${i}.gtf chr${i}.vg > ${GRAPH_PREFIX}_${i}.vg

In this case the chromosome paths will still be present in the spliced graph. You can then use -p in vg call to specify which reference path(s) to get the calls on.

Are you using RNA-seq data for vg call? If so, I just quickly want to mention that using vg call for variant calling or genotyping from RNA-seq data is not something we have evaluated. It has been developed and parameterized with genomic data in mind so I am not sure how well it will perform on transcriptomic data.

ashleethomson commented 2 years ago

Thank you so much for this information, it will help immensely! And also thank you for the advice regarding vg call. I will make note of this if any discrepancies arise in my results. Thank you again!