josuebarrera / GenEra

genEra is a fast and easy-to-use command-line tool that estimates the age of the last common ancestor of protein-coding gene families.
GNU General Public License v3.0
46 stars 7 forks source link

split results #19

Open Proginski opened 1 year ago

Proginski commented 1 year ago

Hi,

Is your feature request related to a problem? Please describe. As the last release of the human genome, with its ~145k CDS produces a 630Go results, and as the help of v1.4.0 says that one needs around 200Go RAm for 180Go of results, it seems one needs ~700Go of RAM to complete the analysis with the -F option.

Describe the solution you'd like Once step 1 (+/- 2) is completed, is it possible to manually split the input fasta and Diamond results to better each chunk's performance? (I'm not saying it will not require a lot of RAM also ;) )

Describe alternatives you've considered I just tried something like

faSplit sequence cds_from_genomic.faa 10 cds_from_genomic
grep ">" cds_from_genomic04.fa | sed -E "s/>(.*)/^\1\t/" > cds_from_genomic04.txt
grep -f cds_from_genomic04.txt tmp_9606_18134/9606_Diamond_results.bout > cds_from_genomic04.bout # This step of course is "expensive"
genEra \
-t 9606 \
-q  cds_from_genomic04.fa\
-n 40 \
-p cds_from_genomic04.bout \
-c 9606_ncbi_lineages.csv \
-r ncbi_lineages_2023-07-12.csv \

The chunk has 87 CDS and of course, it went turbo-fast. The ages assigned to the CDS were the same as when the entire original fasta was used. So is it possible to do so, and could it be of any interest?

Paul

Proginski commented 1 year ago

Ok so I just dug for the first time into the code, and as far as I understand, the program generates a subset of the big '${TAXID}_Diamond_results.bout' for each query. Now FASTSTEP3R precisely does that job faster but requires a lot of RAM.

I am thinking of a way to do it with 2 Go of RAM : awk -F"\t" '{ print $0 >> "tmp_"$1".bout" }' ${TAXID}_Diamond_results.bout

Would that be compatible with the general functioning of genEra ? It might take a while but certainly not as much as the multiple 'grep' commands from the former versions.

josuebarrera commented 1 year ago

Dear Paul,

Thanks again for your useful comments! The first enhancement you propose seems quite interesting, but I suspect it would not be compatible when applying additional options to the pipeline (e.g., adding proteins for organisms that are absent from the NR database). I may be wrong, though! The second solution seems very straightforward and easy to implement. Give me a few days so I can give both of the enhancements a try.

Best, Josué.

Proginski commented 1 year ago

About the first way, I thought the additional genomes were blasted separately and then merged to the big results '.bout' In that case, you can just concat the two files before the parsing I suppose.

Any way if the second suggestion works, the first one is not necessary, but for now I can just try the first one without having to change the code. ( still in progress but it took 1h36min to parse a 262Go ${TAXID}_Diamond_results.bout, and it should be faster or nearly the same with the second suggestion).

Also if the latter suggestion is fine, we could try to deal with parallel writing conflicts and use multiple cpus to faster it.

RocesV commented 1 year ago

Dear @josuebarrera,

the proposition of @Proginski is very interesting with awk -F"\t" '{ print $0 >> "tmp_"$1".bout" }' ${TAXID}_Diamond_results.bout

Mixing the split first all per gene results approach of FASTSTEP3R together with this command should avoid the large RAM use performed by data.table for importing ${TAXID}_Diamond_results.bout into R.

So far i have been doing some test (with ~ 63 Go and ~ 170 Go ${TAXID}_Diamondresults.bout) and FASTSTEP3R it is still faster (vs single-fashion awk commands; parallel-fashion not checked)_, but the the runtime difference between FASTSTEP3R and the awk command is in the range of some hours (maximum difference did not reach 24 hours for ${TAXID}_Diamond_results.bout of these sizes).

I think that this difference in the runtime, at least for outputs of these sizes, could be assumed for most of the users and, therefore, make the pipeline widely available for users without that level of resources.

Additionally, i have compared the difference in runtimes between parallel-fashion (Erassignment-like) awk command using splitted ${TAXID}_Diamond_results.bout as in FASTSTEP3R and single-fashion awk command.

Single-fashion awk command

awk -F"\t" '{ print $0 >> "tmp_"$1".bout" }' ${TAXID}_Diamond_results.bout

Parallel-fashion awk command (Erassignment-like) | ${NTHREADS} = 10

split -m l/${NTHREADS} ${TAXID}_Diamond_results.bout -d Diamond_F3R_ # Split $DIAMONDOUT into N chunks depending on the number of threads without linebreaks
for SUBOUT in $(ls ${TMP_PATH}/Diamond_F3R_*); do
awk -F"\t" '{ print $0 >> "tmp_"$1".bout" }' ${SUBOUT} & # parallel awk command Erassignment like for each split of $DIAMONDOUT
done
wait

I have checked that both approaches seems to produce tmp_${GENE}.bout files containing same results, so parallel writting conflicts may not suppose a big thing. The parallel-fashion awk command (Erassignment-like) is ~ 1.4 times faster than single-fashion awk command for ~ 63 Go ${TAXID}_Diamond_results.bout.

Of course, these are very raw tests with great range of improvement and all needs to be further checked, but it seems very promising! If all works fine -F argument could be turned false as default due to the huge amount of resources used and set the default run with the parallel-fashion awk commands. If you finally decide to implement it and need some help i will be completely available to code so let me know! 👨‍💻

Cheers,

Víctor

Proginski commented 1 year ago

Hi,

In order to deal with potential writing conflicts, we could : Process the ${TMPPATH}/tmp${NCBITAX}_self_matches.bout appart from the main ${TMP_PATH}/${NCBITAX}_Diamond_prefiltered_results.bout (also true for custom proteomes with '-a'), then we know that each query is grouped and seen in a continuous chunk. That way, we just need to deal with cases where the split occurs in the middle of a "query chunk" (which is probably the most frequent case). To do so, each file should "ignore" the first query encountered, or more precisely, put it aside. The following queries cannot be found in another file without being a "first query". A the end we just deal with the first queries, which represent a very small amount of lines.

With results=${TMP_PATH}/${NCBITAX}_Diamond_prefiltered_results.bout, and n=${NTHREADS} :


mkdir tmps/ firsts/ time split -n l/${n} $results ${results}_chunk_ # Split the results files into as many files as the number of cpus. `test_n_send() { awk -F"\t" '

    FNR==1 { first_query = $1 }

    {
            if ($1 == first_query) {
                    print $0 >> "firsts/first_"FILENAME
            } else  {
                    print $0 >> "tmps/tmp_"$1".bout"
            }
    }

    ' $1 ;

}` export -f test_n_send

time parallel -j $n test_n_send {} ::: ${results}_chunk_* # In parallel, for each chunk send each query to an individual tmp file

cat firsts/* | awk -F"\t" '{ print $0 >> "tmps/tmp_"$1".bout" }'


Now I need to have a look at how to integrate that into the current program. I haven't done big tests 'till now but on small "~ 10Go" results files and 10 cpus, it treats 1Go in around 15-20 sec.

If it is easy for you to integrate that code and if it works as expected, just let me know, otherwise, I'll try myself ;) (help ! It does not seem easy )

Paul

josuebarrera commented 1 year ago

Dear @Proginski , I apologize for the late reply. I've been dealing with a lot of paperwork and applications these last few weeks. Unfortunately, I haven't been able to implement your suggestions into the code. I would encourage you to give it a go and let me know if the implementation worked. I'm sure that @RocesV could give you a helping hand to make a pull request. Otherwise, you'll have to wait until November when things (hopefully) slow down for me. All the best, Josué.

Proginski commented 1 year ago

Just forget about my last proposition : Because of the multiple options combinations that are possible, especially those where the user skips some steps, I'm going with a big split, parallel processing, and then individual joins. Way simpler and safer to plug into the current code ;)