muellan / metacache

memory efficient, fast & precise taxnomomic classification system for metagenomic read mapping
GNU General Public License v3.0
57 stars 12 forks source link

Merging results from querying partitioned database #33

Closed jaimeortiz-david closed 2 years ago

jaimeortiz-david commented 2 years ago

Dear Andre,

I am working with a partitioned database (44 total chunks, each 16GB). The querying of the database works well (I believe) and reasonably fast. However, once I have the 44 txt files, and I am merging these to get abundance-per-species, the computation is taking a lot of time (now running for three days). I am unsure if I am doing something wrong or if this is normal behavior (each txt file is around 100-300GB, and the fastq pair-end file contains 100 species with 10,000 reads each, a total of 1 million reads). Below is an example of the line I used to merge the text files:

metacache merge results_mix05 -taxonomy ncbi_taxonomy -lowest species -abundances 05_abund_extraction.txt -abundance-per species

Also, I am using the 64-bit version of metacache to generate k-mers of size 32.

Do you think if I partition the database to even smaller chunks will work faster for merging the final results?

I will appreciate any guidance you could provide.

Best wishes,

Jaime

Funatiq commented 2 years ago

Hi! The merging should scale linearly with the number of files. I would recommend to use less database parts if possible.

I am a little surprised that the txt files are multiple hundreds of GB for only 1 million reads. Which options did you choose for querying?

Lastly, did you check the memory consumption while merging? If you run out of memory and the system starts swapping this can lead to extremely bad runtime.

jaimeortiz-david commented 2 years ago

Ho Robin,

Thank you very much for your response.

I am attaching a copy of my script. I am working on a cluster using SLURM. As you can see, I used job arrays (one per sub-database) to accelerate the query process.

The memory consumption seems to be ok. The merging job has not reached the maximum memory requested for this process (300GB).

I hope that my script can give you additional information to understand the text file size and the time it takes to merge the files to calculate relative abundances.

Kind regards,

Jaime

On Jul 5, 2022, at 4:01 AM, Robin Kobus @.***> wrote:

Hi! The merging should scale linearly with the number of files. I would recommend to use less database parts if possible.

I am a little surprised that the txt files are multiple hundreds of GB for only 1 million reads. Which options did you choose for querying?

Lastly, did you check the memory consumption while merging? If you run out of memory and the system starts swapping this can lead to extremely bad runtime.

— Reply to this email directly, view it on GitHub https://github.com/muellan/metacache/issues/33#issuecomment-1174742729, or unsubscribe https://github.com/notifications/unsubscribe-auth/AETBU3AD6SAZ5FS6VYWOKNDVSPTVLANCNFSM52QYUDEA. You are receiving this because you authored the thread.

Funatiq commented 2 years ago

I don't think attachments work with email replies. Could you upload the script in github or use something like pastebin?

jaimeortiz-david commented 2 years ago

Hi Robin,

I do apologize for that. I am attaching the script here slurm_merge_10Mix_mod copy.txt .

Funatiq commented 2 years ago

If I understand correctly you first copy multiple TB of result files to the scratch directory. Is this really necessary? Can't you access these files directly with Metacache? Have you timed how long this copy takes? I think you can put "srun" in front of your commands (e.g. srun $c1) to get separate timings in sacct.

I am still worried how your result files got so large. Can you maybe upload the first few lines (head) of a result file?

jaimeortiz-david commented 2 years ago

Hi,

Here are the first lines of one of the result files:

Reporting per-read mappings (non-mapping lines start with '# ').

Only the lowest matching rank will be reported.

Classification will be constrained to ranks from 'species' to 'domain'.

Classification hit threshold is 5 per query

At maximum 2 classification candidates will be considered per query.

File based paired-end mode:

Reads from two consecutive files will be interleaved.

Max insert size considered 0.

Using 128 threads

TABLE_LAYOUT: query_id | query_header | top_hits | rank:taxname

/workdir/jdo53/Mix_05sp/fish_db_R1.fastq + /workdir/jdo53/Mix_05sp/fish_db_R2.fastq

11265 | CM001427.1-28530/1 | 473354:6,280673:4 | subclass:Neopterygii 11266 | CM001430.1-32512/1 | 7897:2,371672:2 | -- 11267 | CM001410.1-281764/1 | 181435:4,240162:4 | -- 11268 | CM001409.1-376322/1 | 7897:4,371672:4 | -- 11269 | CM001411.1-6520/1 | 161453:5,7897:4 | subphylum:Craniata 11270 | CM001408.1-243296/1 | 179366:4,56726:4 | -- 11271 | CM001418.1-68805/1 | 56726:4,240162:4 | -- 11272 | CM001416.1-115266/1 | 280673:4,72055:4 | -- 11273 | CM001406.1-281222/1 | 47706:4,7897:2 | --

Funatiq commented 2 years ago

OK, this looks fine. But this should not accumulate to 100GB for 1 million reads. It should be in the order of 100MB per file.

Nevertheless, I have no idea what's going wrong with the merging. Does the merge call produce any output? It should show a progress bar to indicate how many files have been merged. Does it write anything to the output files? Have you tried merging just 2 files at once?

jaimeortiz-david commented 2 years ago

Hi,

The merging process stopped since it took too long to finalize. I think there is an issue somewhere, and the program maintains an endless loop. The error file is a massive 450GB text file, here are the 50 first lines and the last 50 lines which are identical! It does not even go over 10% of the query.

Applied taxonomy to database. Merging result files. max canddidates: 2 number of files: 37 [=======> ] 10%Query 1: Could not read taxid. Query 1: taxid 0 not found. Skipping hit. Query 1: Could not read taxid. Query 1: taxid 0 not found. Skipping hit. Query 1: Could not read taxid. Query 1: taxid 0 not found. Skipping hit. Query 1: Could not read taxid. Query 1: taxid 0 not found. Skipping hit. Query 1: Could not read taxid. Query 1: taxid 0 not found. Skipping hit. Query 1: Could not read taxid. Query 1: taxid 0 not found. Skipping hit. Query 1: Could not read taxid. Query 1: taxid 0 not found. Skipping hit. Query 1: Could not read taxid. Query 1: taxid 0 not found. Skipping hit. Query 1: Could not read taxid. Query 1: taxid 0 not found. Skipping hit. Query 1: Could not read taxid. Query 1: taxid 0 not found. Skipping hit. Query 1: Could not read taxid. Query 1: taxid 0 not found. Skipping hit. Query 1: Could not read taxid. Query 1: taxid 0 not found. Skipping hit. Query 1: Could not read taxid. Query 1: taxid 0 not found. Skipping hit. Query 1: Could not read taxid. Query 1: taxid 0 not found. Skipping hit. Query 1: Could not read taxid. Query 1: taxid 0 not found. Skipping hit. Query 1: Could not read taxid. Query 1: taxid 0 not found. Skipping hit. Query 1: Could not read taxid. Query 1: taxid 0 not found. Skipping hit. Query 1: Could not read taxid. Query 1: taxid 0 not found. Skipping hit. Query 1: Could not read taxid. Query 1: taxid 0 not found. Skipping hit. Query 1: Could not read taxid. Query 1: taxid 0 not found. Skipping hit. Query 1: Could not read taxid. Query 1: taxid 0 not found. Skipping hit. Query 1: Could not read taxid. Query 1: taxid 0 not found. Skipping hit.

-bash-4.2$ tail -n 50 slurm_10Mix_merge_4431989_4294967294.err Query 1: taxid 0 not found. Skipping hit. Query 1: Could not read taxid. Query 1: taxid 0 not found. Skipping hit. Query 1: Could not read taxid. Query 1: taxid 0 not found. Skipping hit. Query 1: Could not read taxid. Query 1: taxid 0 not found. Skipping hit. Query 1: Could not read taxid. Query 1: taxid 0 not found. Skipping hit. Query 1: Could not read taxid. Query 1: taxid 0 not found. Skipping hit. Query 1: Could not read taxid. Query 1: taxid 0 not found. Skipping hit. Query 1: Could not read taxid. Query 1: taxid 0 not found. Skipping hit. Query 1: Could not read taxid. Query 1: taxid 0 not found. Skipping hit. Query 1: Could not read taxid. Query 1: taxid 0 not found. Skipping hit. Query 1: Could not read taxid. Query 1: taxid 0 not found. Skipping hit. Query 1: Could not read taxid. Query 1: taxid 0 not found. Skipping hit. Query 1: Could not read taxid. Query 1: taxid 0 not found. Skipping hit. Query 1: Could not read taxid. Query 1: taxid 0 not found. Skipping hit. Query 1: Could not read taxid. Query 1: taxid 0 not found. Skipping hit. Query 1: Could not read taxid. Query 1: taxid 0 not found. Skipping hit. Query 1: Could not read taxid. Query 1: taxid 0 not found. Skipping hit. Query 1: Could not read taxid. Query 1: taxid 0 not found. Skipping hit. Query 1: Could not read taxid. Query 1: taxid 0 not found. Skipping hit. Query 1: Could not read taxid. Query 1: taxid 0 not found. Skipping hit. Query 1: Could not read taxid. Query 1: taxid 0 not found. Skipping hit. Query 1: Could not read taxid. Query 1: taxid 0 not found. Skipping hit. Query 1: Could not read taxid. Query 1: taxid 0 not found. Skipping hit. Query 1: Could not read taxid. Query 1: taxid 0 not found. Skipping hit. Query 1: Could not read taxid. Query 1: taxid 0 not found. Skipping hit.

Funatiq commented 2 years ago

Well, it seems like one of your result files is corrupt. Can you please check if all queries finished properly? I will write a fix for the infinite loop. It should be possible to skip corrupt results.

Funatiq commented 2 years ago

The infinite loop is fixed in v2.2.3.

jaimeortiz-david commented 2 years ago

Hi,

Thank you very much for fixing this. Additionally, I have found out why the files are corrupt. Some file names make no sense. See attached lines from the query result text files. Maybe this can help to get an additional perspective and better understand why these strange names appear during the process (bug in my script?), which corrupts the files and sends the process into an infinite loop.

491519 | CAKAKQ010001760.1-23/1 | 166741:1 | -- 491520 | CAKAKQ010025327.1-63/1 | 166741:3 | --

/workdir/jdo53/4423340/Mix_10sp/results_mix10/res_mix10_10.txt + /workdir/jdo53/4423340/Mix_10sp/results_mix10/res_mix10_11.txt

/workdir/jdo53/4423340/Mix_10sp/results_mix10/res_mix10_13.txt + /workdir/jdo53/4423340/Mix_10sp/results_mix10/res_mix10_5.txt

/workdir/jdo53/4423340/Mix_10sp/results_mix10/res_mix10_6.txt + /workdir/jdo53/4423340/Mix_10sp/results_mix10/res_mix10_7.txt

/workdir/jdo53/4423340/Mix_10sp/results_mix10/res_mix10_8.txt + ??t??t??t"1???H??M1??fDH?71??f?H?H??????t??t??t"1???H???M1??fDH?71??f??o?????t??t??t"1???H??M1??fDH?71??f?H?H?????

queries: 1000000

time: 64370 ms

speed: 932111 queries/min

unclassified: 94.5548% (472774)

/workdir/jdo53/4423315/Mix_10sp/results_mix10/res_mix10_10.txt + /workdir/jdo53/4423315/Mix_10sp/results_mix10/res_mix10_5.txt

/workdir/jdo53/4423315/Mix_10sp/results_mix10/res_mix10_6.txt + /workdir/jdo53/4423315/Mix_10sp/results_mix10/res_mix10_7.txt

/workdir/jdo53/4423315/Mix_10sp/results_mix10/res_mix10_8.txt + ????W&??

queries: 1000000

time: 29285 ms

speed: 2.04883e+06 queries/min

unclassified: 45.5042% (227521)

Funatiq commented 2 years ago

Sorry, I don't really understand where those lines with the filenames appear. What's the content of the results_mix10 directory? Are there any other files besides the 44 result files? Could you maybe upload your query script, too?

In your merge script you save the merged result in the result directory. This would be included as input file if you call the script another time.

jaimeortiz-david commented 2 years ago

Dear Robin,

Thank you for your response.

I thought I attached the query script. I am attaching it here.

The results_mix10 directory contains only the 44 results files. My initial thought was that there was something else there that might have created the weird lines.

slurm_mix10_query_mod copy.txt

Funatiq commented 2 years ago

Ok, so the problem is as I suspected, but in the parent directory Mix_10sp. If you give a directory to MetaCache, it will query all files in the directory and its sub-directories. If you query Mix_10sp, it will also query all files in Mix_10sp/results_mix10. This would include results from other jobs in the array. I would suggest that you store the read files and result files in separate directories.

jaimeortiz-david commented 2 years ago

Hi Robin,

Thank you so much for all your help and for releasing the new version that improves the merging process. It seems that now the merging process is working fine. I also noticed that the merging issue was also happening with result files that did not assign any taxa. In other words, this was not only happening with corrupted files but with files that had no results. I hope this makes sense.

Many thanks again!

Best wishes,

Jaime

Funatiq commented 2 years ago

You're welcome. I'm glad it works now.