Parsing --external-gene-calls can not handle non CDS gene calls

jbird9 commented 7 years ago

A Contig-Error is thrown every time a sequences cannot be translated even if that sequence is from a tRNA or rRNA. It would be very handy to be able to visualize these sequences on the genome plots downstream.

I wrote a ruby script to parse gff files from PROKKA annotations pipeline into the anvio tab delimited format. Perhaps, an additional tab should be required to tell the program whether it should throw an error for not being able to translate the protein or that it should ignore the error because the feature is not a protein.

Thanks,

Jordan Bird

meren commented 7 years ago

Hmm. This is an interesting problem. We only use amino acid sequences for protein clustering though :/ I am not sure how that could be addressed. Any opinions?

meren commented 7 years ago

Please ignore my previous message and questions :)

Setting the "partial" column to 1 in the EGC file for CDS gene calls should suppress those errors. Did you try that?

jbird9 commented 7 years ago

OK, I think I can implement that in my parser easily enough.

meren commented 7 years ago

Plug on behalf of others: It would have been lovely if you could share your parser with a simple example for future references :) We can talk more about it offline if you are willing.

For instance Mike Lee had written a small set of instructions for something he did, and was very useful for many (including us): http://merenlab.org/2016/05/21/archaeal-single-copy-genes/

jbird9 commented 7 years ago

Hi Meren,

Here is the script I wrote. It is a work in progress. Hopefully, my strategy for fixing the off by one error for the start or stop reported in the gff is correct (Otherwise all the proteins are wrong, haha).

#!/usr/bin/ruby

i = 1

puts "gene_callers_id\tcontig\tstart\tstop\tdirection\tpartial\tsource\tversion"    
File.readlines("#{ARGV[0]}").each do |line|
    next if line.match(/^#/)
    next if line.match(/""/)
    break if line.match(/^>/)
    direction = ""
    id = ""
    tabs = line.split("\t")
    if tabs[2].match(/CDS/) then
        part = 0
    else
        part = 1
    end 
    attributes = tabs[8].split(";")
    attributes.each do |txt|
        if txt.include? "ID"
            id = txt
        end
    end
    idparts = id.split("_")
    idnum = idparts[-1].to_i.to_s
    if tabs[6] == "-" then
        direction = "r"
    else
        direction = "f"
    end
    tab3min1 = tabs[3].to_i - 1
    linestr = "#{idnum}\t#{tabs[0]}\t#{tab3min1}\t#{tabs[4]}\t#{direction}\t#{part}\tPROKKA\t1.11"
    puts linestr
    i=i+1   
end

I then implemented the script in bash with

list=`ls *gff`

for i in $list; do ruby gff2anvio.rb $i > $i.anvio; done

list=`ls *anvio`

for i in $list; do sort -g -k 1,1 $i > $i.sorted; done

And used the sorted files as my --external-gene-calls input

meren commented 7 years ago

That is why I had asked for an option in Prokka so the user could provide gene sequences instead of contigs (so you could use anvi-get-aa-sequences-for-gene-calls or anvi-get-dna-sequences-for-gene-calls, and everything would've been absolutely consistent). Got no response from the developer. Such is life.

jbird9 commented 7 years ago

Gene sequences? GFF files are pretty standard and flexible annotation files so hopefully this script can be usefully to people that want to use their PROKKA calls.

meren commented 7 years ago

But that is not the point. You should be able to decouple gene calling program from the annotation program. In this workflow, you are limited to how genes are called by the annotation program. The solution to that is to be able to by-pass the gene calling step done in the annotation program by directly providing sequences for gene calls, which is not possible in this case.

jbird9 commented 7 years ago

Oh well that sequences for the genes in dna and aa are standard outputs of PROKKA.

On Oct 14, 2016 5:54 PM, "A. Murat Eren" notifications@github.com wrote:

But that is not the point. You should be able to decouple gene calling program from the annotation program. In this workflow, you are limited to how genes are called by the annotation program. The solution to that is to be able to by-pass the gene calling step done in the annotation program by directly providing sequences for gene calls, which is not possible in this case.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/meren/anvio/issues/408#issuecomment-253928967, or mute the thread https://github.com/notifications/unsubscribe-auth/ADa5FNBmko2QPPDBgBInVhSKwvZmxD2mks5qz_oAgaJpZM4KXb9U .

meren commented 7 years ago

It is very nice of PROKKA, but they should be one of the standard 'inputs' so you can use whatever gene caller you want and still be able to annotate your sequences :)

jbird9 commented 7 years ago

I am still not quite sure if I am following, but for those people that would like both their PROKKA and INTERPROSCAN annotations imported into their contigs.db(s). They should use the following scripts.

You will want to use your .faa output from PROKKA as the input for INTERPROSCAN in order for this to work.

First, if you ran Interproscan with -pa like me you will have to use cut to grab just the first 11 tabs that don't include the pathway tabs with can be of variable length (this was discussed in another issue thread).

list=`ls *tsv`

for i in $list; do cut -f 1,2,3,4,5,6,7,8,9,10,11 $i > ${i::-3}noPA.tsv; done

Then you can use a subset of the script from the gff2anvio.rb script from above to rename your PROKKA IDs to match the IDs you already have in your contigs.db (which have to be intergers).

#!/usr/bin/ruby
File.readlines("#{ARGV[0]}").each do |line|
    tabs = line.split("\t")
    idparts = tabs[0].split("_")
    idnum = idparts[-1].to_i.to_s
    linestr = "#{idnum}\t#{tabs[1]}\t#{tabs[2]}\t#{tabs[3]}\t#{tabs[4]}\t#{tabs[5]}\t#{tabs[6]}\t#{tabs[7]}\t#{tabs[8]}\t#{tabs[9]}\t#{tabs[10]}"
    puts linestr
end

list=`ls *noPA*`

for i in $list; do ruby interproscanofprokka2anvio.rb $i > ${i::-3}ints.tsv; done

Then finally those files can be used to add annotations to contigs.db without any problems.

list=`ls CONTIGSDB/*contigs.db`

for i in $list; do anvi-import-functions -c $i -i INTERPROSCAN_FILES${i:9:-10}noPA.ints.tsv -p interproscan; done

Disclaimer: Your file organization and naming may not be the same as mine and therefore code should be altered to fit your setup.

meren commented 7 years ago

I am closing this issue, but I should come back to this to maybe write a little piece on the blog so it is not lost in the issues section.

merenlab / anvio

Parsing --external-gene-calls can not handle non CDS gene calls #408