Open hyattpd opened 4 years ago
Dealing with gaps in scaffolds. Have a predefined or user-specified minimum gap length, and treat all stretches of Ns or Xs longer than this threshold as gaps. Meaning that the genes longer than the minimum gene length can start or end in the gap.
This is already implemented in the development version for 3.0, so no worries.
RBS
or promoter
regions (if they are part of the model)--threads
support; important for large metagenomestRNA
/rRNA
awareness; maybe a mask file, or support lowercase masked FASTA? Often small ORFs can tRNAs overlap; one of them is a false positive?+1 on the --threads
Yeah, will definitely do multithreading. Go makes that easy.
Masking tRNAs/rRNAs will also be in. I've toyed with the idea of predicting them myself, but feels a bit like reinventing the wheel and not a good use of time.
All the above is doable. #3 would require a bit of thought. It's really difficult the way my graph of starts/stops is organized, but I have some ideas for how to efficiently do this.
+1 on:
RBS
and promoter
(if possible)In addition, (and I know this is tricky) detection of small/er ORFs
if somehow possible. A preliminary check on the NCBI non-redundant proteins revealed ~32 600 proteins shorter than 30 aa with the smallest down to only 5 aa. This includes important proteins, e.g. sporulation genes (https://www.ncbi.nlm.nih.gov/proteinclusters/5848899) I don't know how many of these are proper proteins and how much miss-predicted junk is buried therein. But recalling more of them would certainly be beneficial.
Really hard to find short genes without a lot of false positives. I may make a "sensitive" flag if the user really wants to see all this.
I think a database of known short genes would make sense?
Whether this should go in prodigal
or prokka
/PGAP
is unclear!
We have a 6aa delta toxin in S.aureus which I've manually added in the past with tblastn
@oschwengers do you know if PGAP
(NCBI) handles the small proteins?
It's on the list for prokka 2.x
Another feature:
/translation_table
used (in meta mode)@tseemann no, not for sure. All I know is that PGAP
follows an approach combining ab initio "ORF region" prediction and homology searches (ORFFinder
+blastp
/hmmer
) and in addition a tblastn
/prosplign
based detection of pseudogenes (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5753331/figure/F1/).
So by using ORFFinder
they are able to fetch those small genes which in turn, is computationally rather expensive when I recall it right.
I made a separate issue for discussion of using evidence. I do feel it's important to have this capability to fix mistakes made by ab initio gene prediction.
I suggest to add continuous integration on the repo as a whole. This signals the high quality of the project, making some people (e.g. me) more inclined to build their research upon Prodigal. See #77
Feature request:
possibility use custom genetic code, for example move throw some stop codons. see https://www.biorxiv.org/content/10.1101/2020.07.20.212944v1
Open Reading Frames (ORFs) were predicted using Prodigal [21] with default parameters but the predicted ORFs were excluded if they spanned a sequencing gap region. please help me how to do this predicted ORF exclude if they spanned a sequencing gap region
Any new feature requests.. can put them here.