salzberg-lab / Balrog

Bacterial Annotation by Learned Representation of Genes
MIT License
54 stars 5 forks source link

over-prediction?over-extended? #2

Open igortru opened 4 years ago

igortru commented 4 years ago

hundreds or thousands of “extra” genes per genome,
but black box is black box

1) for example PGAP report 434 genes on plasmid https://www.ncbi.nlm.nih.gov/protein?LinkName=nuccore_protein&from_uid=1887424984 but Balrog - 499 , are they real?

2) another issue : protein start position NC_016612.1 Balrog CDS 488100 488564 . + 0 inference=ab initio prediction:Balrog;product=hypothetical protein is it choose longest possible?

I prefer shorter version which correspond conserved domain : compare ref|WP_014226776.1| ref|WP_016239709.1| ref|WP_016247216.1| ref|WP_021555225.1| ref|WP_023303013.1| ref|WP_032154610.1| ref|WP_060617254.1| ref|WP_109862812.1| ref|WP_135564269.1| ref|WP_171279038.1| ref|WP_172833565.1| ref|WP_172901084.1| ref|WP_172949498.1| ref|WP_181654360.1|

Markusjsommer commented 4 years ago

Hi Igortru,

Balrog tends to predict less genes than other gene finders with default parameters on complete bacterial genomes. We did not look specifically at plasmids, but it's an interesting place for comparison. I'm not super familiar with all the steps in PGAP, especially with plasmids, but Balrog would be intended as one piece in the larger pipeline, more akin to Prodigal/Glimmer/GeneMark rather than replacing a whole pipeline like Prokka or PGAP.

Looking at the "Klebsiella aerogenes strain RHBSTW-00938 plasmid pRHBSTW-00938_2, complete sequence" plasmid you mentioned, Balrog predicts 497 genes. I ran GeneMarkS-2 as a quick comparison on the same sequence and it predicted 544 genes, so it appears here Balrog actually predicts less.

For the start sites it's a bit more complicated. Balrog takes into account each potential ORF's start codon and the sequence around it, as well as the length. All else being equal, Balrog will tend to choose longer genes, but good hits with the Translation Initiation Site (TIS) model, or incompatibilities with other high-scoring genes, can shift the start site of any individual gene to be shorter. The global maximal gene score is found, rather than the maximal score for any one gene. I would still trust start sites based on evolutionary conservation more than the predictions of a gene finder, so a good step after running Balrog may be a better start site predictor which takes into account more complex information like that.

igortru commented 4 years ago

but Balrog would be intended as one piece in the larger pipeline

imho : after 30 years of genomes annotation pipeline development , starting from scratch and invent more intellectual orffinder -it is like develop new operational system.

it can be very good exercise, but exist much more difficult and interesting problems for Deep Learning : for example, improvement of already existing annotations.

just idea: take all non hypothetical protein names from refseq take corresponding protein sequences ,

100M unique sequences ,many of them have good names,thanks to Daniel Haft, and try to find distributed sequence motifs which will allow predict protein names and fix incorrect ones.

I absolutely sure it is possible,and it is real interesting problem

P.S. we really need command line version of your tool, it will allow run it in batch mode and check more deeply. I want compare it with Phannotate.

Markusjsommer commented 4 years ago

Ideally, Balrog would not be entirely replacing whole annotation pipelines. Rather, it's an attempt at using the vast amounts of data we have to train a more complex gene model than would have been possible 10 years ago. Hopefully we can integrate with and complement many of the other great tools out there (a faster command line version is definitely on my todo list).

Using a language model to predict/correct protein names based on sequence seems like an interesting idea, but a bit beyond the scope of this tool right now :)

Markusjsommer commented 4 years ago

As an aside, Balrog can process multiple genomes at once if you select multiple fasta files during the upload step, though Colab can be annoying about downloading more than 10 files at once

igortru commented 4 years ago

Ok.

I am developing pipeline which will allow cluster/annotate closely related phages. let say, genomes from one genus,subfamily,family.

At the end of each round I am expecting to have set of really good alignments . regular clustering is not always useful, very tight clusters -is not interesting very wide- multiple alignment is bad, truth somewhere in the middle : produce widest clusters inside taxonomic node which still have good alignments -tcoffee TCS as criteria

as first step I am using “orffinderplus”, probably , it is not public yet, but I can share it with you,if you are interested, new ncbi program which combine existing genbank annotation with all top level orfs ( mark corresponding orfs with genbank accession if it was found on annotation) it useful and allow to see which orfs are new from one side and from othetr side allow add to processing some orfs which normal orffinder just ignore.

actually ,it is problem for your tool as well -missing pseudo genes,I see it on klebsiella plasmid.

orffinderplus produce too much orfs and I am looking how replace it.

pgap, prokka and phannotate produce too few genes on phages and ... it is major problem -results are different and I don’t know which program I can trust. for now I prefer to have more models than less.

again, I am talking only about phage/prophages. my current set is about 15K complete genbank genomes

P.S. I’ll try ten genomes - charlie+redi+butter viruses

Sent from my iPhone

On Sep 10, 2020, at 12:43 AM, Markus Sommer notifications@github.com wrote:

 As an aside, Balrog can process multiple genomes at once if you select multiple fasta files during the upload step, though Colab can be annoying about downloading more than 10 files at once

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

igortru commented 4 years ago

It looks speed is proportional number of contigs , not total length of sequences. I have uploaded 1.4Mb fasta which contain 31 phages. but it takes already more than hour, probably your server already heavy loaded. other reason outsource your tool :)

Markusjsommer commented 4 years ago

Definitely shouldn't take that long regardless of the number of contigs, I'm thinking it might be because of the way I'm calling MMseqs2. If you send me the fasta (zipped via github comment or you can email at markusjsommer@gmail.com) I can try to find why it's taking so long for you

igortru commented 4 years ago

another issue: circular genomes support. for phages,plasmids - it is major issue. for bacterial genomes also could be useful.

igortru commented 4 years ago

phage KC576783 - 66 genes , Barlog missing 12 genes in comparison with genbank. Other genomes from Butters+Charlie+Redi virus have the same problem.

method looks very interesting, but training set is not looking perfect , it will be very kind from your side if you make training part open source as well. I just want feed it with all phage proteins.

Markusjsommer commented 4 years ago

Balrog is not super optimized for very short sequences like phage. I'll add circular genome support in a future release, as that is not too hard, but it will likely only result in finding 1-2 more genes.

Releasing a new model trained on viruses would not be too difficult, though other parameters may need to be retuned as well to get good performance.