micans / mcl

MCL, the Markov Cluster algorithm, also known as Markov Clustering, is a method and program for clustering weighted or simple networks, a.k.a. graphs.
https://micans.org/mcl
Other
89 stars 12 forks source link

Introducing new node to existing network? #1

Closed ml3958 closed 2 years ago

ml3958 commented 2 years ago

Hi,

Thanks so much for developing this amazing method. I have used mcl on my very large protein networks (~10milion proteins) and It works pretty. I have two questions:

  1. I want to expand to large proteins, however the .abc file (~2T) has exceed my system RAM, and seems to to kill the maxload process.

  2. Is there a way to add new proteins into the existing network? If so, I can solve the first problem and potentially skip the bulk of mcl, which took days to run.

Any input would be greatly appreciated. Best, Menghan

micans commented 2 years ago

Hi Menghan,

Thank you for your kind words. It is possible to speed up/optimise the loading of the matrix, but I have some doubts; is it the case that you want to add proteins into the existing network and then re-cluster (A), or do you want to add the proteins to the clustering (B) (as you write “potentially skip the bulk of mcl”). If it is (A) then I can probably make a recipe that helps.

ml3958 commented 2 years ago

Thanks for your prompt reply, Stijn.

I was hoping to do (B) initially because of the fact that adding new proteins would exceed my system RAM, thus cause problem for maxload and mcl. Is it a reasonable concern? I would love to hear your solution to (A) as well !

Thanks, Menghan

micans commented 2 years ago

Hi Menghan,

For (A) the mcxload step can be improved I’m pretty sure. Did the abc file (2T) exceed your disk space, rather than your RAM? It is possible to create an mcl matrix in incremental steps, not needing to feed it all the edges at the same time. I think it should be possible as well to update a matrix for an existing network by adding another matrix that encodes only the new edges.

Also you could give mcl different parameters so that it still runs and does not use as much memory. However, each time you cluster will still be a very big compute job.

For (B) I have a memory that I looked into this at some point; I’ll get back to you about that later

ml3958 commented 2 years ago

Thanks so much, Stijin. I am looking forward to hearing about (B). If it works, it will give me so much flexibility of easily adding new genomes into a existing pipeline/analysis. :-)

For (A), the 2TB abc file is smaller than my disk. My machine has 14T disk space left and 384GB RAM. I now went back to my log and found the error message (below), and It seems that my script was killed at the mcxload step. Perhaps I should set a limit on RAM for maxload?

.................................................. 31829M .................................................. 31830M .................................................. 31831M .................................................. 31832M .................................................. 31833M .................................................. 31834M .................................................. 31835M .................................................. 31836M .................................................. 31837M .................................................. 31838M .............................................. /mnt/data1/menghanliu/gephe_jgi/scripts/python/module//run_02_mcxload.sh: line 27: 10397 Killed mcxload -abc $dir_orthogroup/alignment_evalue_orig.abc -o $dir_orthogroup/alignment_evalue.mci -write-tab $dir_orthogroup/seq_dic.tab --stream-neg-log10 -stream-tf 'ceil(400)' --write-binary -ri max

Thanks so much, Menghan

micans commented 2 years ago

Setting a limit will not help. Perhaps a limit (RAM or time) is in place, and exceeding that limit caused the mcxload process to be killed (or a bug). However, you've run mcl on 10M nodes, and that should be much more memory intensive than mcxload, suggesting it cannot be a limit. It is difficult to know what to do without more information. Do you run this on a node of a compute cluster?

Regardless of all these considerations, for constructing the matrix I think it would be beneficial to subdivide reading the matrix into different jobs, each constructing a sub-network, then adding the networks together. How many lines are in the big abc file? At least 31,838M I gather. How many nodes would there be in the network? If this is based on something like blast scores, are you already using a cut-off for the weights (disregard edges below a certain log-score)?

Adding nodes to an existing clustering is a bit of a stopgap. By the way, it is a problem that has different solutions (e.g. something like a weighted majority vote), and this is not really to do with mcl. You could conceivably do it yourself; what I once implemented was a simple weighted-vote-like method (I will check if the mcl software provides this).

The more nodes you add to an existing clustering, the more of a stopgap it is. What is the total number of nodes you'd like to have a clustering for; or could it forever grow as new data becomes available?

Sorry, many questions, but with this size of data a good strategy and understanding of the context and requirements really helps, as the compute is so expensive both in time, cpu and RAM.

ml3958 commented 2 years ago

Thanks so much for the reply. I will try to answer point to point

Setting a limit will not help. Perhaps a limit (RAM or time) is in place, and exceeding that limit caused the mcxload process to be killed (or a bug). However, you've run mcl on 10M nodes, and that should be much more memory intensive than mcxload, suggesting it cannot be a limit. It is difficult to know what to do without more information. Do you run this on a node of a compute cluster? Regardless of all these considerations, for constructing the matrix I think it would be beneficial to subdivide reading the matrix into different jobs, each constructing a sub-network, then adding the networks together. How many lines are in the big abc file? At least 31,838M I gather. How many nodes would there be in the network? If this is based on something like blast scores, are you already using a cut-off for the weights (disregard edges below a certain log-score)?

I ran this on my own workstation (12 core) so time should not be a limiting factor as a a cluster limits job running time. So I am guessing the problem is more with the RAM (384GB). With this current failed job, I have ~19 million nodes, 31,838,930,734 lines in abc file, based blast scores (e^-10 cutoff). The previous successful run was with half of data size. Please let me know if I can provide more info.

Adding nodes to an existing clustering is a bit of a stopgap. By the way, it is a problem that has different solutions (e.g. something like a weighted majority vote), and this is not really to do with mcl. You could conceivably do it yourself; what I once implemented was a simple weighted-vote-like method (I will check if the mcl software provides this).

I was thinking something similar! And I completely agree it is a bit stopgap, so I was also thinking to do a pilot analysis to show such suboptimal method (weighted majority vote) is reasonably consistent with redoing the mcl. What would be your intuition/experience with the comparison outcome?

The more nodes you add to an existing clustering, the more of a stopgap it is. What is the total number of nodes you'd like to have a clustering for; or could it forever grow as new data becomes available?

My current thought is that to start with a diverse protein collection (as diverse as possible), and then it would be forever grow as new data are being considered.

Sorry, many questions, but with this size of data a good strategy and understanding of the context and requirements really helps, as the compute is so expensive both in time, cpu and RAM.

Thanks so much for your help. I would love to provide more details. This has been such a critical step in my research so I appreciate the existence of mcl and your help :)

micans commented 2 years ago

I ran this on my own workstation (12 core) so time should not be a limiting factor as a a cluster limits job running time. So I am guessing the problem is more with the RAM (384GB). With this current failed job, I have ~19 million nodes, 31,838,930,734 lines in abc file, based blast scores (e^-10 cutoff). The previous successful run was with half of data size. Please let me know if I can provide more info.

Aha so mcxload died after it had parsed all edges. Going from 10M nodes to 19M nodes is quite a big step up; it is safe to assume that 384GB ram is the limiting factor. It seems likely that you would need one of a) bigger hardware b) a different cluster algorithm or c) a diffferent implementation of mcl (e.g. hipMCL, but that will require access to cluster hardware as well and I don't know how easy it is to set it up).

I checked the node assignment by the way, clm residue (shipped with mcl) does that, but requires reading in the full network for all nodes, so that's not useful. Also with a jump from 10M nodes to 19M nodes this approach is not suitable. Have you looked at cluster algorithms like CD-HIT, DBSCAN, OPTICS, or Louvain or Leiden? They probably have lower memory requirements than mcl.

I haven't kept up with the literature and the plethora of existing clustering algorithms. Ensembl families used to use mcl on networks approaching 10M nodes but now they use the TreeFam HMM library to define clusters.

In your case it might be interesting to do some pre-redundancy pruning to set a certain set of nodes aside and avoid very similar nodes in the network. This of course would be defining your own hybrid clustering algorithm (and this idea is similar to CD-HIT I think).

I was thinking something similar! And I completely agree it is a bit stopgap, so I was also thinking to do a pilot analysis to show such suboptimal method (weighted majority vote) is reasonably consistent with redoing the mcl. What would be your intuition/experience with the comparison outcome?

My hunch is that this works a bit as long as you're only adding 1 or 2% extra nodes, but certainly not if you double the number of nodes.

My current thought is that to start with a diverse protein collection (as diverse as possible), and then it would be forever grow as new data are being considered.

A bit related to the remarks about pre-redundancy pruning and CD-HIT above. Have you tried other clustering algorithms? How quickly does your data grow? After 19M nodes, would the next step be more gradual, e.g. 5% more nodes, or would there again be a jump up?

ml3958 commented 2 years ago

Aha so mcxload died after it had parsed all edges. Going from 10M nodes to 19M nodes is quite a big step up; it is safe to assume that 384GB ram is the limiting factor. It seems likely that you would need one of a) bigger hardware b) a different cluster algorithm or c) a diffferent implementation of mcl (e.g. hipMCL, but that will require access to cluster hardware as well and I don't know how easy it is to set it up).

I checked the node assignment by the way, clm residue (shipped with mcl) does that, but requires reading in the full network for all nodes, so that's not useful. Also with a jump from 10M nodes to 19M nodes this approach is not suitable. Have you looked at cluster algorithms like CD-HIT, DBSCAN, OPTICS, or Louvain or Leiden? They probably have lower memory requirements than mcl.

I was considering a hybrid approach! I clustered those ~19 million proteins using CD-HIT at 90% identity to ~13 million in a day. So CD-hit definitely is able to reduce the data to the scale that mcl can handle. A few things I was considering and wonder what you think:

  1. how many and how to pick representative sequences for downstream mcl clustering. the longest ones? or most diverse (sequence-level)?
  2. In general, I always wonder what is the rule of thumb rule practice while using algorithm like CD-HIT that largely considers identity only, but not statistics (i.e., evalue). Since I care about grouping sequences to functionally orthogonal families, my limited experience of using CDhit suggest it produces many small clusters (even at low identity of 60% and high coverage threshold), and orthogonal groups tend to be broken into small subgroups. I wonder what your take on this as this evaluation is rarely mentioned in past literatures. Not being negative at all, but my conclusion was CD-HIT is not ideal to produce orthogonal families but rather to reduce data redundancy. Therefore, a hybrid ofCD-HIT and mcl might actually be a great solution for big data with speed and accuracy. Sorry I think I've combined answers to multiple of your questions here.

I haven't kept up with the literature and the plethora of existing clustering algorithms. Ensembl families used to use mcl on networks approaching 10M nodes but now they use the TreeFam HMM library to define clusters.

I didn't know Ensemble, I will check. I was considering using Interpro (Uniprot) database. My thought with those database is that they're not totally agnostic, but usually relies on some external curated information (but might not be a bad thing :) ).

In your case it might be interesting to do some pre-redundancy pruning to set a certain set of nodes aside and avoid very similar nodes in the network. This of course would be defining your own hybrid clustering algorithm (and this idea is similar to CD-HIT I think).

I was thinking something similar! And I completely agree it is a bit stopgap, so I was also thinking to do a pilot analysis to show such suboptimal method (weighted majority vote) is reasonably consistent with redoing the mcl. What would be your intuition/experience with the comparison outcome?

My hunch is that this works a bit as long as you're only adding 1 or 2% extra nodes, but certainly not if you double the number of nodes.

My current thought is that to start with a diverse protein collection (as diverse as possible), and then it would be forever grow as new data are being considered.

A bit related to the remarks about pre-redundancy pruning and CD-HIT above. Have you tried other clustering algorithms? How quickly does your data grow? After 19M nodes, would the next step be more gradual, e.g. 5% more nodes, or would there again be a jump up?

After the initial clustering, next step would be gradual-perhaps 0.01% at a time. So one could even think of a three step approach: CD-HIT --> mcl --> clm residue? I have to check how to use clm residue. Is there an example somewhere?

Thank you so much

micans commented 2 years ago

Hi sorry slight delay in response.

I was considering a hybrid approach! I clustered those ~19 million proteins using CD-HIT at 90% identity to ~13 million in a day. So CD-hit definitely is able to reduce the data to the scale that mcl can handle. A few things I was considering and wonder what you think:

1. how many and how to pick representative sequences for downstream mcl clustering. the longest ones? or most diverse (sequence-level)?

I wonder if a slightly different hybrid approach might work, which is to find a way to break up the network into different parts, using for example single linkage clustering, or perhaps by using CD-HIT to create a network that manages to filter weak edges, then apply single linkage clustering on that, then use MCL on the individual components.

There is a technical aspect to this. Apart from mcl, the underlying library and siblings programs have been designed to deal efficiently with large network manipulations, providing some ways of querying and transforming. Hence there are scripted approaches that would make some of this easy.

The main challenge is to be able to work with the initial big data. I wonder if the mcxload step could be improved, however, with your data the raw result matrix will take 16 bytes per edge (8 bytes per arc); if you specified the edge in both directions in the 'abc' file this would be (31838930734*8)/384000000000 = 66% of your RAM memory (and twice that if the edge was specified only once).

2. In general, I always wonder what is the _rule of thumb_ rule practice while using algorithm like CD-HIT that largely considers identity only, but not statistics (i.e., evalue). Since I care about grouping sequences to functionally orthogonal families, my _limited_ experience of using CDhit suggest it produces many small clusters (even at low identity of 60% and high coverage threshold), and orthogonal groups tend to be broken into small subgroups.  I wonder what your take on this as this evaluation is rarely mentioned in past literatures. Not being negative at all, but my conclusion was CD-HIT is not ideal to produce orthogonal families but rather to reduce data redundancy. Therefore,  a hybrid of` CD-HIT` and `mcl` might actually be a great solution for big data with speed and accuracy. Sorry I think I've combined answers to multiple of your questions here.

I don't have experience really with using CD-HIT, so cannot comment on that. Since you mention orthogonal families, have you considered OrthoMCL? I assume you have species annotation?

A bit related to the remarks about pre-redundancy pruning and CD-HIT above. Have you tried other clustering algorithms? How quickly does your data grow? After 19M nodes, would the next step be more gradual, e.g. 5% more nodes, or would there again be a jump up?

After the initial clustering, next step would be gradual-perhaps 0.01% at a time. So one could even think of a three step approach: CD-HIT --> mcl --> clm residue? I have to check how to use clm residue. Is there an example somewhere?

There is a manual page (https://micans.org/mcl/src/mcl-latest/doc/clmresidue.html) I can give a recipe (as it would involve some manipulation of tab files and networks), but perhaps best to do this once the problem of clustering 19M nodes (using linkage+components, or size reduction of some kind) is solved.

micans commented 2 years ago

Tied to the previous comment: how about trying one or two approaches to cut the network into smaller components (each could still be, say, up to 5 million nodes) with the purpose of clustering each component separately? Initially it would be best to see if dividing the network into components is feasible on a test case, say with up to 1M nodes.

micans commented 2 years ago

Hi Menghan, do you have any further thoughts on this? I could conceivably help (e.g. with code) if you know what sort of direction you want to try (e.g. single-linkage, CD-HIT, OrthoMCL, something else). An example of a hybrid approach of single linkage clustering + mcl is e.g. described in this paper (on much smaller data): https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-5-45 . Otherwise I'll close this issue in a few days time - in that case if you comment on it again I'll re-open it.