Closed jbird9 closed 7 years ago
Hmm. This is an interesting problem. We only use amino acid sequences for protein clustering though :/ I am not sure how that could be addressed. Any opinions?
Please ignore my previous message and questions :)
Setting the "partial" column to 1 in the EGC file for CDS gene calls should suppress those errors. Did you try that?
OK, I think I can implement that in my parser easily enough.
Plug on behalf of others: It would have been lovely if you could share your parser with a simple example for future references :) We can talk more about it offline if you are willing.
For instance Mike Lee had written a small set of instructions for something he did, and was very useful for many (including us): http://merenlab.org/2016/05/21/archaeal-single-copy-genes/
Hi Meren,
Here is the script I wrote. It is a work in progress. Hopefully, my strategy for fixing the off by one error for the start or stop reported in the gff is correct (Otherwise all the proteins are wrong, haha).
#!/usr/bin/ruby
i = 1
puts "gene_callers_id\tcontig\tstart\tstop\tdirection\tpartial\tsource\tversion"
File.readlines("#{ARGV[0]}").each do |line|
next if line.match(/^#/)
next if line.match(/""/)
break if line.match(/^>/)
direction = ""
id = ""
tabs = line.split("\t")
if tabs[2].match(/CDS/) then
part = 0
else
part = 1
end
attributes = tabs[8].split(";")
attributes.each do |txt|
if txt.include? "ID"
id = txt
end
end
idparts = id.split("_")
idnum = idparts[-1].to_i.to_s
if tabs[6] == "-" then
direction = "r"
else
direction = "f"
end
tab3min1 = tabs[3].to_i - 1
linestr = "#{idnum}\t#{tabs[0]}\t#{tab3min1}\t#{tabs[4]}\t#{direction}\t#{part}\tPROKKA\t1.11"
puts linestr
i=i+1
end
I then implemented the script in bash with
list=`ls *gff`
for i in $list; do ruby gff2anvio.rb $i > $i.anvio; done
list=`ls *anvio`
for i in $list; do sort -g -k 1,1 $i > $i.sorted; done
And used the sorted files as my --external-gene-calls input
That is why I had asked for an option in Prokka so the user could provide gene sequences instead of contigs (so you could use anvi-get-aa-sequences-for-gene-calls
or anvi-get-dna-sequences-for-gene-calls
, and everything would've been absolutely consistent). Got no response from the developer. Such is life.
Gene sequences? GFF files are pretty standard and flexible annotation files so hopefully this script can be usefully to people that want to use their PROKKA calls.
But that is not the point. You should be able to decouple gene calling program from the annotation program. In this workflow, you are limited to how genes are called by the annotation program. The solution to that is to be able to by-pass the gene calling step done in the annotation program by directly providing sequences for gene calls, which is not possible in this case.
Oh well that sequences for the genes in dna and aa are standard outputs of PROKKA.
On Oct 14, 2016 5:54 PM, "A. Murat Eren" notifications@github.com wrote:
But that is not the point. You should be able to decouple gene calling program from the annotation program. In this workflow, you are limited to how genes are called by the annotation program. The solution to that is to be able to by-pass the gene calling step done in the annotation program by directly providing sequences for gene calls, which is not possible in this case.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/meren/anvio/issues/408#issuecomment-253928967, or mute the thread https://github.com/notifications/unsubscribe-auth/ADa5FNBmko2QPPDBgBInVhSKwvZmxD2mks5qz_oAgaJpZM4KXb9U .
It is very nice of PROKKA, but they should be one of the standard 'inputs' so you can use whatever gene caller you want and still be able to annotate your sequences :)
I am still not quite sure if I am following, but for those people that would like both their PROKKA and INTERPROSCAN annotations imported into their contigs.db(s). They should use the following scripts.
You will want to use your .faa output from PROKKA as the input for INTERPROSCAN in order for this to work.
First, if you ran Interproscan with -pa like me you will have to use cut to grab just the first 11 tabs that don't include the pathway tabs with can be of variable length (this was discussed in another issue thread).
list=`ls *tsv`
for i in $list; do cut -f 1,2,3,4,5,6,7,8,9,10,11 $i > ${i::-3}noPA.tsv; done
Then you can use a subset of the script from the gff2anvio.rb script from above to rename your PROKKA IDs to match the IDs you already have in your contigs.db (which have to be intergers).
#!/usr/bin/ruby
File.readlines("#{ARGV[0]}").each do |line|
tabs = line.split("\t")
idparts = tabs[0].split("_")
idnum = idparts[-1].to_i.to_s
linestr = "#{idnum}\t#{tabs[1]}\t#{tabs[2]}\t#{tabs[3]}\t#{tabs[4]}\t#{tabs[5]}\t#{tabs[6]}\t#{tabs[7]}\t#{tabs[8]}\t#{tabs[9]}\t#{tabs[10]}"
puts linestr
end
list=`ls *noPA*`
for i in $list; do ruby interproscanofprokka2anvio.rb $i > ${i::-3}ints.tsv; done
Then finally those files can be used to add annotations to contigs.db without any problems.
list=`ls CONTIGSDB/*contigs.db`
for i in $list; do anvi-import-functions -c $i -i INTERPROSCAN_FILES${i:9:-10}noPA.ints.tsv -p interproscan; done
Disclaimer: Your file organization and naming may not be the same as mine and therefore code should be altered to fit your setup.
I am closing this issue, but I should come back to this to maybe write a little piece on the blog so it is not lost in the issues section.
A Contig-Error is thrown every time a sequences cannot be translated even if that sequence is from a tRNA or rRNA. It would be very handy to be able to visualize these sequences on the genome plots downstream.
I wrote a ruby script to parse gff files from PROKKA annotations pipeline into the anvio tab delimited format. Perhaps, an additional tab should be required to tell the program whether it should throw an error for not being able to translate the protein or that it should ignore the error because the feature is not a protein.
Thanks,
Jordan Bird