metagentools / MetaCoAG

🚦🧬 Binning Metagenomic Contigs via Composition, Coverage and Assembly Graphs
https://metacoag.readthedocs.io/en/stable/
GNU General Public License v3.0
57 stars 5 forks source link

Extremely long file write times #14

Closed GabeAl closed 2 years ago

GabeAl commented 2 years ago

From a 600MB assembly of long-read data with an unsupported assembler, so take this with a grain of salt, as I used the "megahit" mode (because I have no idea what the "paths" file is and no such file was produced from this assembler), but it seems to work well except at the end it slows down when writing the output files. The GFA file has embedded sequence inside it (in fact I used the GFA to generate the fasta which I aligned to for coverage).

For context:

2021-10-27 15:41:58,302 - INFO - Welcome to MetaCoAG:
...
2021-10-27 15:57:20,595 - INFO - Elapsed time: 922.2919390201569 seconds
2021-10-27 15:57:21,327 - INFO - Writing the Final Binning result to file
2021-10-27 16:42:55,475 - INFO - Producing 185 bins...
2021-10-27 16:42:55,476 - INFO - Final binning results can be found in outs/bins/
2021-10-27 16:42:55,476 - INFO - Thank you for using MetaCoAG!

So this means the entire computation is done in 16 minutes... but the file writing (with awk...) takes almost an hour to output the mags. Is this typical? Is there a speedup trick I can do here?

Vini2 commented 2 years ago

Hello @GabeAl,

Thank you for raising this issue. I have fixed the issue by adding a more efficient method to write the final binning results to individual bin files. Please get a new pull from the repo and give it a try.

If you don't mind, can I know the assembler you used? MetaCoAG currently supports the edge sequences from the GFA file of Flye and I'm working to add support to the original Flye contigs. As long as the sequences in the contig file match those in the GFA file, MetaCoAG should work fine. The .paths file is a special file produced only from the SPAdes assembler and hence, is not required for MetaCoAG in other modes (please check Input format).

Also, if it's possible, please let me know how good the results are and if you have any suggestions on improving MetaCoAG. It would really help me to learn and improve MetaCoAG.

Let me know how things go.

Best regards, Vijini