Fast Step 3 Implementation

RocesV commented 11 months ago

Dear @josuebarrera @LotharukpongJS @HajkD,

First of all, sorry for the delay and thank you so much for the patience! 😄

SUMMARY:

I have added a new argument -F (${FAST_STEP3}) and script FASTSTEP3R for fixing the long time to compute Step 3 in GenEra at the cost of more RAM used, ~ 210 GB for 180 GB STEP1 diamond output, and more tmp files generated, as much as the number of gene queries. I have setted the default -F as false so users can check if they have enough resources to run with this option as true.

TOY EXAMPLE:

In order to compare the advantages of this new -F (${FAST_STEP3}) i have used the best scenario for the actual implementation (without -F | with -F false ) which consists in small STEP1 outputs so the grep command take less time in each ${GENE} iteration. As STEP1 output size increases (e.g greater number of query genes) the difference between the actual implementation (-F false) and -F (${FAST_STEP3}) implementation (-F true) will increase.

RAM: 256 GB
CPUs: 50
INPUT: 10,000 sequences Arabidopsis (~ 30 GB Diamond STEP1 output)
STEP1 skipped by supplying final diamond output
STEP2 skipped by supplying arranged ncbi lineages
STEP4 not invoked (no -s file)

COMMANDS

Actual implementation or -F false

genEra -q $INPUT/Athaliana/Athaliana_seqs.fasta -t 3702 -p $INPUT/3702_Diamond_results.bout -c $INPUT/3702_ncbi_lineages.csv -n 46 -F false -o $OUTPUT/Athaliana_CASUAL_OUTPUT/ -x $TMP/Athaliana_CASUAL_TMP/

_-F (${FAST_STEP3}) implementation or -F true_

genEra -q $INPUT/Athaliana/Athaliana_seqs.fasta -t 3702 -p $INPUT/3702_Diamond_results.bout -c $INPUT/3702_ncbi_lineages.csv -n 46 -F true -o $OUTPUT/Athaliana_FASTSTEP3R_OUTPUT/ -x $TMP/Athaliana_FASTSTEP3R_TMP/

RESULTS

Even for small STEP1 outputs (~ 30 GB) such as 10,000 query genes the speed improvement of -F (${FAST_STEP3}) implementation or -F true vs Actual implementation or -F false is x ~5.46 times faster.

Efficiency

P.S: once you confirm that this could be useful, the number of lines for splitting the STEP1 output can be further tested for improving RAM performance.

If you find any issues, i will be completely available so let me know! 👨‍💻

Cheers,

Víctor

josuebarrera commented 11 months ago

Thank you very much for implementing the fast mode of GenEra! I'm sure this will make all our users happy! I think that it might be better to make the fast implementation -F TRUE by default. And only let people know that genEra can be run with fewer resources by using -F FALSE at the cost of longer running times. What do you think?

RocesV commented 11 months ago

I am totally agree Josue! 😄

josuebarrera commented 11 months ago

Perfect, I already made the changes! Thanks again for this great contribution!

josuebarrera / GenEra

Fast Step 3 Implementation #16