hyattpd / Prodigal

Prodigal Gene Prediction Software
GNU General Public License v3.0
432 stars 85 forks source link

Up to 2x faster prodigal #95

Open jaebeom-kim opened 2 years ago

jaebeom-kim commented 2 years ago

Hi, I'm Jaebeom Kim and developing a metagenomic classifier, Metabuli. I'm using prodigal to predict genes of thousands of prokaryotic genomes, so I'm working on making prodigal faster.

Here are some modifications. Test environment: MacBook Pro, 1.4 GHz quad-core Intel Core i5, 16GB 2133 MHz LPDDR3 1) Rapid mode. In the function named 'dprog', I found a suspicious part. I think lines 52 and 53 are making the program slow. When i==999 and MAX_NODE_DIST==500, 'min' becomes 0 in line line 54, which leads to calling 'score_connection' 999 times. I'm not sure it is intended or not. So I just make an 'Rapid mode' to jump the lines. When I tested with E.coli genome (GCF_000008865.2), Rapid mode decreased the running time (training + prediction) from ~9.9 sec to ~6.5 sec, while producing the same results.

2) MAX_NODE_DIST While reviewing the source code, I found that reducing MAX_NODE_DIST in dprog.h can decrease running time. So, I decreased it from 500 to 300 and tested it using E.coli genome. The running time (training + predicting) decreased from ~9.9 sec to ~7.2 sec, while producing the same result. But still, it may lead to prediction of lower quality in other cases, so I just made it as an option for users who want to get results faster.

When rapid mode is used with MAX_NODE_DIST 300, the same predictions were produced in ~5.5 seconds, which is about 2X acceleration.

I tried to follow the code style :) If you like the changes, please accept this PR and update the conda package as well.

hyattpd commented 1 year ago

I need to think about this one. Dynamic programming by default just considers every node (n compared to remaining n-1). This would be extremely slow, so a fix was put in to only look back 500 nodes. This changes the running speed to nx500. Unfortunately, you can get some incredibly long genes so the code needed a fix to handle that edge case and allow a connection bigger than that if it was connecting a start and a stop node in the same gene. Changing that 500 number will of course give linear speedup (nx500 to nx250 will halve the run time of this function), but if the number becomes too low, legitimate connections between nodes could be disallowed. It would take some testing to determine what's a reasonable value (it also varies by GC content, since you can get more/less nodes per 1000bp).