hyattpd / Prodigal

Prodigal Gene Prediction Software
GNU General Public License v3.0
426 stars 85 forks source link

New version discussion: feature requests #64

Open hyattpd opened 4 years ago

hyattpd commented 4 years ago

Any new feature requests.. can put them here.

mhuntemann commented 4 years ago

Dealing with gaps in scaffolds. Have a predefined or user-specified minimum gap length, and treat all stretches of Ns or Xs longer than this threshold as gaps. Meaning that the genes longer than the minimum gene length can start or end in the gap.

hyattpd commented 4 years ago

This is already implemented in the development version for 3.0, so no worries.

tseemann commented 4 years ago
  1. Annotation of RBS or promoter regions (if they are part of the model)
  2. Multithreading --threads support; important for large metagenomes
  3. Frame-shift awareness! Very important with Nanopore homopolymer errors.
  4. Pseudo-gene / partial gene support (already sort of exists)
  5. tRNA/rRNA awareness; maybe a mask file, or support lowercase masked FASTA? Often small ORFs can tRNAs overlap; one of them is a false positive?
  6. Expectation of many partial genes in metagenome assemblies
mhuntemann commented 4 years ago

+1 on the --threads

hyattpd commented 4 years ago

Yeah, will definitely do multithreading. Go makes that easy.

Masking tRNAs/rRNAs will also be in. I've toyed with the idea of predicting them myself, but feels a bit like reinventing the wheel and not a good use of time.

All the above is doable. #3 would require a bit of thought. It's really difficult the way my graph of starts/stops is organized, but I have some ideas for how to efficiently do this.

oschwengers commented 4 years ago

+1 on:

In addition, (and I know this is tricky) detection of small/er ORFs if somehow possible. A preliminary check on the NCBI non-redundant proteins revealed ~32 600 proteins shorter than 30 aa with the smallest down to only 5 aa. This includes important proteins, e.g. sporulation genes (https://www.ncbi.nlm.nih.gov/proteinclusters/5848899) I don't know how many of these are proper proteins and how much miss-predicted junk is buried therein. But recalling more of them would certainly be beneficial.

hyattpd commented 4 years ago

Really hard to find short genes without a lot of false positives. I may make a "sensitive" flag if the user really wants to see all this.

tseemann commented 4 years ago

I think a database of known short genes would make sense? Whether this should go in prodigal or prokka/PGAP is unclear! We have a 6aa delta toxin in S.aureus which I've manually added in the past with tblastn

@oschwengers do you know if PGAP (NCBI) handles the small proteins? It's on the list for prokka 2.x

tseemann commented 4 years ago

Another feature:

oschwengers commented 4 years ago

@tseemann no, not for sure. All I know is that PGAP follows an approach combining ab initio "ORF region" prediction and homology searches (ORFFinder+blastp/hmmer) and in addition a tblastn/prosplign based detection of pseudogenes (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5753331/figure/F1/).

So by using ORFFinder they are able to fetch those small genes which in turn, is computationally rather expensive when I recall it right.

hyattpd commented 4 years ago

I made a separate issue for discussion of using evidence. I do feel it's important to have this capability to fix mistakes made by ab initio gene prediction.

richelbilderbeek commented 4 years ago

I suggest to add continuous integration on the repo as a whole. This signals the high quality of the project, making some people (e.g. me) more inclined to build their research upon Prodigal. See #77

Russel88 commented 4 years ago

Feature request:

igortru commented 3 years ago

possibility use custom genetic code, for example move throw some stop codons. see https://www.biorxiv.org/content/10.1101/2020.07.20.212944v1

Russel88 commented 3 years ago
MdUmar-tech commented 3 years ago

Open Reading Frames (ORFs) were predicted using Prodigal [21] with default parameters but the predicted ORFs were excluded if they spanned a sequencing gap region. please help me how to do this predicted ORF exclude if they spanned a sequencing gap region