Open hyattpd opened 5 years ago
Spent some time looking at NCBI's PGAP. Very thorough and impressive pipeline, although this comes at a cost of speed (running HMMER on every ORF is pretty intense!). I think there's still a niche for more lightweight solutions that people can run on their laptops, though. Sketched out a way to do this that should be thousands of times faster (won't be as thorough as what PGAP is doing, though).
Yes PGAP is nice, but uses too much computation unnecessarily. I have an open issue to create a --fast
mode for it.
Using RNA-Seq for evidence has been sort of done: https://www.ncbi.nlm.nih.gov/pubmed/30169674
YES please include an "annotation transfer" tool, or the ability to use existing annotations. This is important to get consistency in a pan-genome. RATT isn't used much anymore. This could be done at DNA using minimap2 contigs to contigs
and paftools liftOver
.
Short gene database would be useful. The DB should be a separate project I think, or some deriviate of Uniprot?
Annotations should/could be TAXON AWARE. some annotations are only applicable to certain phyla/genera etc, especially naming conventions.
I'll just "stay in my lane" for now and focus on protein-coding gene prediction, and evidence sources that aid with that.
Thought about this some more. I don't think it should be the new version's job to aid in getting evidence, just in using it. I'll probably support reading in of BED or GFF files where a field (maybe the score field) can be used to provide an integer that corresponds to a rule ("mask" in the case of tRNA/rRNA/etc., "definitely include this over whatever ab initio program says" in the case of 100% correct mapped evidence, or an option to "weight this evidence somewhat higher but still use the algo to decide").
What about other ncRNA and small RNAs? They can overlap CDS (antisense) or promoters etc.
Will the next version be table to train a model using actual existing known good gene coordinates, rather than guessing from a FASTA ?
The small gene prediction (potentially using user contributed databases) would be excellent to annotate biologically important genes that are currently "overlooked" by Prodigal. This could be an optional flag and users could use their own DB. The annotation transfer option mentioned by @tseemann would be also quite useful since generating consistent annotations is a major problem when analyzing large datasets.
I would be interested in using pan-genome annotation. Also deriving training sets from multiple genomes and plasmid sequences
An example of small proteins are the S.aureus phenol modulins, there are 4 of them, usually tandem, and they are 22aa long.
Adding to @tseemann comment, signalling peptides in Gram positives are good examples of small proteins found in diverse species. An example of these is the competence stimulating peptide of Streptococcus pneumoniae (comC gene).
A quick Swiss-Prot search reveals currently ~470 small (<30 aa) proteins with evidence at protein level: https://www.uniprot.org/uniprot/?query=reviewed%3Ayes+taxonomy%3A%22Bacteria+%289BACT%29+%5B2%5D%22+existence%3A%22evidence+at+protein+level+%5B1%5D%22+length%3A%5B*+TO+30%5D&sort=score
Is this still in the works?
Bacterial gene prediction doesn't usually rely heavily on external evidence to generate gene models. JGI uses GenePrimp as a post-processor, whereas NCBI PGAP searches every single ORF in the genome vs protein databases.
What would people like to see as far as using similarity to known genes to correct genes (find missing ones, correct wrong strand calls, correct start sites)?
Protein evidence: would you like the ability to search a database for protein hints as part of the main program? Or as a post-processor?
Support for RNA-seq and differential RNA-seq: This is already planned (eventually), as a way to predict transcripts and related features (operons, terminators, promoters).
Pangenome evidence: Should you be able to give the new version one or more closely related genomes (already annotated) and have it use those annotations as a guide to call genes in the target?
Short gene database?: Should we incorporate a short gene database somehow?
RNA Genes: If already doing the all the above, does it make sense to just be able to call RNA genes using various methods?