metagentools / MetaCoAG

🚦🧬 Binning Metagenomic Contigs via Composition, Coverage and Assembly Graphs
https://metacoag.readthedocs.io/en/stable/
GNU General Public License v3.0
57 stars 5 forks source link

Is process hanging or will it just take a while? #31

Closed nmb85 closed 2 months ago

nmb85 commented 1 year ago

I love the concept for MetaCoAG; what a great idea! I'm trying to use your awesome tool to bin contigs from a 30 Gbp MEGAHIT metagenomic assembly with 35 million contigs and it has been paused (or working?) at the step after initially assigning contigs with marker genes to bins for a little more than 48 hours. There is no sign that memory or CPU usage has changed in that time and there haven't been any messages printed to the log file or stdout/stderr. The only files in the output directory are the tetranucleotide frequency pickle file and the log file. The log file is attached below.

Is my process hanging, does it require more memory (current usage is steady at 65% of max: 175 GB/250 GB), or is it working? If it is in fact working, what would you expect the time to complete this step to be and is there a flag that I missed for parallelizing this step?

Thanks for any help! Would love to see how MetaCoAG performs! metacoag.log

Vini2 commented 1 year ago

Hello @nmb85,

Thank you very much for your interest in MetaCoAG!

I haven't tested MetaCoAG on datasets having more than a couple of hundred thousand contigs. I don't know how long it will take to complete (maybe a couple of days?).

If it is possible, can you share with me the data you are testing on? I would like to give it a try and see. 35 million contigs sound very interesting!

Thank you!

nmb85 commented 1 year ago

Thank you, @Vini2! I will reach out to you via your contact form on your professional website in order to share the data. The data is proprietary and entails tens of GBs, so I cannot post it on a public link. In the meantime, this was my attempt at parallelizing the get_non_isolated function in metacoag_utils/graph_utils.py:

I imported multiprocessing as mp and broke up the get_non_isolated function into two functions, abstracting away the outermost for loop as a function to run in parallel with mp.

import multiprocessing as mp

def get_connected_components(i, assembly_graph, binned_contigs):
    non_isolated = []
    if i not in non_isolated and i in binned_contigs:
        component = []
        component.append(i)
        length = len(component)
        neighbours = assembly_graph.neighbors(i, mode="ALL")
        for neighbour in neighbours:
            if neighbour not in component:
                component.append(neighbour)
        component = list(set(component))
        while length != len(component):
            length = len(component)
            for j in component:
                neighbours = assembly_graph.neighbors(j, mode="ALL")
                for neighbour in neighbours:
                    if neighbour not in component:
                        component.append(neighbour)
        labelled = False
        for j in component:
            if j in binned_contigs:
                labelled = True
                break
        if labelled:
            for j in component:
                if j not in non_isolated:
                    non_isolated.append(j)
    return non_isolated

def get_non_isolated(node_count, assembly_graph, binned_contigs, nthreads):
    with mp.Pool(processes=nthreads) as pool:
        non_isolated = pool.starmap(get_connected_components, [(i, assembly_graph, binned_contigs) for i in range(node_count)])
    return non_isolated

Then in metacoag, I changed the get_non_isolated function call on lines 653-657 to pass the nthreads variable to the new parallelized get_non_isolated function:

non_isolated = graph_utils.get_non_isolated(
        node_count=node_count,
        assembly_graph=assembly_graph,
        binned_contigs=binned_contigs,
        nthreads=nthreads
    )

Changes ran fine to completion without error messages on a toy dataset, but couldn't observe multiprocessing via htop

nmb85 commented 1 year ago

Brief update: I allowed metacoag to run for 6 days, but it was still stuck at running the get_non_isolated function. No errors, just very busy running that function. I believe parallelizing this step and perhaps downstream steps would help here.

nmb85 commented 1 year ago

Hi @Vini2, you're probably swamped - should we move ahead with trying to parallelize this on our end? If so, any hints or suggestions based on our attempt above?

Vini2 commented 1 year ago

Hi @nmb85,

I'm so sorry I couldn't get back to you regarding this issue. Is everything sorted? Were you able to parallelise the step? I tested your suggested method and it works fine.

Vini2 commented 2 months ago

Closing issue due to inactivity. Please re-open if needed.