ncbi / vadr

Viral Annotation DefineR: classification and annotation of viral sequences based on RefSeq annotation
Other
97 stars 22 forks source link

Feature request: GFF output #15

Closed taltman closed 1 month ago

taltman commented 4 years ago

When validating an annotation, I find it very convenient to load the annotation in GFF format into a genome browser like JBrowse, to see how the alignment works. Unfortunately I didn't find a file in the VADR output directory that could be used directly for this purpose. I hacked together a script (happy to share if it would be of interest) that generates a GFF file from the *.tbl files, and used it to visualize the VADR annotation of a novel CoV genome distantly related to any existing CoV genomes in RefSeq:

image

So my enhancement request is that VADR generates a GFF version 3.0 file which is ready for uploading into genome browsers. Thanks for your consideration!

nawrockie commented 4 years ago

@taltman : I plan to include this in the next version (next version after 1.1.1).

nawrockie commented 3 years ago

Hi @taltman : Sorry GFF output didn't make it into v1.1.1. Could you please share your script with me? I'm curious how you handled the Parent field in the attributes column. Thanks!

taltman commented 3 years ago

Hi @nawrockie , sorry for the delayed response.

I hacked together a quick script to do this conversion. I only "validated" it as much as the GFF annotations displayed sensibly in JBrowse. I didn't run it through any GFF validators to see whether it generates compliant GFF files. HTH!

https://bitbucket.org/tomeraltman/darth/src/master/src/tbl2gff.awk

In gearing up for submitting our novel CoV genomes to EBI, I will be verifying that the generated GFF is compliant, so hopefully I'll have an improved version soon.

nawrockie commented 3 years ago

Great, thanks @taltman ! If you do end up improving it, please let me know.

zhaoxvwahaha commented 3 years ago

Hi, @nawrockie , could VADR output gff3 or gbk file now?

nawrockie commented 3 years ago

@zhaoxvwahaha : Unfortunately, not yet. Development of other features has taken priority. Are you able to use the script kindly shared by Tomer Altman above?:

https://bitbucket.org/tomeraltman/darth/src/master/src/tbl2gff.awk

Zjianglin commented 2 years ago

Hi @nawrockie, It seems the tbl2gff cannot correctly covert the fuzzy positions during determine the ORF strand. For example,

<3437   4116    mat_peptide
                        product NS2a

lines will be converted to MW164737 vadr mat_peptide 4116 <3437 . - . ID=ftr-8;Name=NS2a; in result GFF file.

I tried to modify the scipt to

$1 && $2 {
        if ( match($1, "[><]") != 0 ) {
                begx = $1
                # The gsub() function returns the number of substitutions made
                gsub("[><]", "", begx)
        } else {
                begx = $1
        }
        if ( match($2, "[><]") != 0 ) {
                endx = $2
                gsub("[><]", "", endx)
        } else {
                endx = $2
        }
      if ( int(begx) < int(endx) ) {
                start  = $1
                end    = $2
                strand = "+"
        } else {
                start  = $2
                end    = $1
                strand = "-"
        } 
        ftr_key = $3
        ++ftr_id
#print start, end, strand, ftr_key, ftr_id
}

Does this modification is right? By the way, Is there any python library could parse the GFF3 file and process fuzzy positions, the BCBio(https://github.com/chapmanb/bcbb) filed to read the records with fuzzy positions.

nawrockie commented 2 years ago

@Zjianglin : I'm not sure about your modification of tbl2gff, you might try asking Tomer Altman who wrote that code (https://bitbucket.org/tomeraltman).

I don't know of any python library that can handle the fuzzy positions.

My suggestion would be to either try to modify tbl2gff not output the '>' and '<' characters, or write a simple script that strips them out as an extra step after you've run created the gff file. You could also try writing a script that strips them out of the .tbl file that vadr creates prior to running tbl2gff, or trying to parse the output .ftr table that vadr creates (https://github.com/ncbi/vadr/blob/master/documentation/formats.md#ftr) but note that the .ftr table does not have coordinate positions 'trimmed' due to Ns, like the .tbl file does.

Zjianglin commented 2 years ago

Hi @nawrockie , thanks for your reply and suggestions. I would try to manually check the genomes that with fuzzy positions and strip them out. Thank you again.

Zjianglin commented 2 years ago

Hi @nawrockie , sorry for the delayed response.

I hacked together a quick script to do this conversion. I only "validated" it as much as the GFF annotations displayed sensibly in JBrowse. I didn't run it through any GFF validators to see whether it generates compliant GFF files. HTH!

https://bitbucket.org/tomeraltman/darth/src/master/src/tbl2gff.awk

In gearing up for submitting our novel CoV genomes to EBI, I will be verifying that the generated GFF is compliant, so hopefully I'll have an improved version soon.

Hi @taltman , could you please check the my modification of the [tbl2gfff](https://bitbucket.org/tomeraltman/darth/src/master/src/tbl2gff.awk)? The original script seems cannot process fuzzy positions.

kapsakcj commented 1 year ago

+1 for this request

I've had a couple of requests to visualize the outputs of VADR in IGV, specifically in a GFF3 file

cimendes commented 1 month ago

Is this feature still on the roadmap for VADR development?

nawrockie commented 1 month ago

@cimendes sorry for the long delay on this requested feature. I'm working on it now and will post another update by the end of next week.

cimendes commented 1 month ago

Thank you for the update! That is wonderful news!

nawrockie commented 1 month ago

@cimendes : I added a miniscripts/annotate-tbl2gff.pl a 'miniscript' that can be used to convert v-annotate.pl .tbl output files to GFF3 format. The script is in the develop branch currently, and will be included in the next released version. For now, the version in the develop branch should work fine as a standalone conversion script.

This GFF format is not meant to be used for GenBank submissions. Use the .vadr.pass.tbl and .vadr.fail.tbl files for that.

Do

perl annotate-tbl2gff.pl -h

to see information on usage and options.

Please let me know if there are any problems with the script or feature requests.