12T data to process - Githubissues

zhengqibonan commented 1 month ago

Many thanks for your job! I have almost 12T data that needs to be clustered by the Markov Cluster algorithm, do you have any suggestion?

[mclIO full] reading ....................................... [mclIO] read native binary 14393317x14393317 matrix with 1788352905 entries [mcl] pid 4019741 ite ------------------- chaos time hom(avg,lo,hi) m-ie m-ex i-ex fmv 1 ___> Vector with idx [115], maxval [0.000133] and [1185220] entries -> initially reduced to [2] entries with combined mass [0.000250]. -> Consider increasing the -P value and increasing the -S value. -> (before rescaling) Finished with [1400] entries and [0.044286] mass. ls

___> Vector with idx [635], maxval [0.000139] and [1215076] entries -> initially reduced to [2] entries with combined mass [0.000270]. -> Consider increasing the -P value and increasing the -S value. -> (before rescaling) Finished with [1400] entries and [0.067914] mass. killed

micans commented 1 month ago

Hello, could you perhaps read the exchange of all comments in #8; in that case a graph of size 31965032x31965032 with 107744198225 nodes was clustered.

Your command was being killed almost certainly because the machine you ran it on did not have sufficient memory. This is not the only concern however. The average node degree in your network is 124, but there must be nodes in it that have a much higher degree. The error message above tells us that node with idx=115 is able to see 1185220 nodes in the network in two steps; this would be the case for example if that node had approximately 1088 neighbours and each of those had again approximately 1088 neighbours, none of which were shared. It is likely to be much more skewed than that; there must be nodes with very high degree (number of neighbours).

What sort of data is this? Protein sequence data? Metagenomics? From the little information I see above I would suggest to preprocess the network and make sure there are fewer edges in it, especially focusing on ensuring there are fewer nodes with high degree.

zhengqibonan commented 4 weeks ago

Thanks for your reply, what I need to work with is metagenomic data and I need to cluster the genomes based on the AAI.How to preprocess the network? Could you give me some templates?

micans commented 4 weeks ago

Hi, I don't have experience with metagenomic data. I think mcl may have been used with it, I remember seeing publications that used it, but this may have been on sequences that had some kind of pre-filter. Here are two survey papers that may be helpful:

https://academic.oup.com/bib/article/21/1/1/5098604 https://www.nature.com/articles/s41467-018-04964-5

Popular algorithms are (I think) CD-HIT, uclust and linclust. Are you aware of these? Is there a reason they are not suitable? I see that AAI is Average Amino-acid Identity. I would seek to generate some statistics about how your networks (e.g. the histogram of node degrees) change as you vary e.g. the cut-off threshold. I would also do this on a smaller test dataset, anywhere between 10,000 and 100,000 nodes or at least some smaller input data that is interpretable as a test. I don't have templates, and suggest looking into standard approaches in the field. I'm happy to help with any mcl-specific problems.

zhengqibonan commented 4 weeks ago

Thanks for your reply! You are very nice!!! After read the comments in #8 ,I got much information needed and also think mcl can work on 12T data. I'm trying do this and I'll ask you questions when I get some problems. I also know that CD-HIT and uclust can be used in sequence clustering, but these tools are different in algorithm Compared with Markov Cluster algorithm. According to reference, the inflation parameter I used is 1.2, and I think other tools do not contain Markov Cluster algorithm.

micans commented 4 weeks ago

Thank you for your kind words. I am a little curious why you have 12T of data. The network in #8 has more nodes and edges and was reported as 1.5T, and 700G when converted to 'mcx' format with mcx convert. Is the 12T of data in one of the mcl matrix formats, or is it a list of pairwise sequence matches?

Inflation 1.2 is very low. This would only be useful, in my experience so far, for a network that already shows a good degree of separation. A practical issue that arises with low inflation values is that mcl will also take much more time and use more memory, generally. For datasets of the size you have this just becomes very painful. If you test with large-scale data it is better to start testing with a high inflation value such as 6 (and I would also run tests on smaller data sets and scale up from there).

Most importantly, the quality of your clustering will be determined to a large extent by how you construct the network. If the network has nodes with very high degree or if the distribution of node degrees is very skewed (such as in scale-free networks) then in my experience mcl tends not to work well. A simplistic starting point for a tunable approach is something like a threshold where only 95% or 98% or even higher identity matches are allowed (and edges below it are removed). I am not sure whether the following is relevant, but it might even make sense to check that that are no identical sequences in your input. Understanding the characteristics of your similarity measure in your datasets (a simple start is to look at histograms of edge weights and of node degrees) could give you ideas for appropriate filtering - as of course, any approaches that are standard in the field of metagenomics.

zhengqibonan commented 4 weeks ago

Hi, Thank you for your prompt response! I am trying to convert abc format to file mci format or binary format (mcx format). I think data is 12 T due to the abc format and I estimate the generated matrix file in mci format has dimensions 16000000 x 16000000. mcxload -abc data --stream-mirror --write-binary -o data.mcx mcxload -abc data --stream-mirror -write-tab data.tab -o data.mci But I don't kown if it can work. Thanks for your explanations for the parameters inflation 1.2, I just followed the workflow of the reference and it tell me to use 1.2. By the way, do you have any good advice on the classification of DNA viruses?

micans commented 4 weeks ago

I suggest to only use binary format, so the first version. To test you can simply do

head -n 1000000 data | mcxload -abc - --stream-mirror --write-binary -o data.mcx

this will only give the first 1M lines of data to the mcx command.

Is the workflow of the reference also in metagenomics? Is this a reference you can share? It is either a different type of data, or it is the same type of data but heavily filtered in some manner, and it would be good to know which is the case.

I have no advice on the classification of DNA viruses. It sounds interesting, and I have a faint recollection mcl was used in this sort of thing. There must be good literature on this - have you searched / found?

zhengqibonan commented 4 weeks ago

Hi, Thank you for your patient explanation! I tried to convert abc format to binary format (mcx format), and when mcxload 15000M lines, the memory occupied of the sever is 260G. But my data is 105138M lines and I used a big memory node with 1T memory. If it's proportional, I don't have enough memory on my server and I read that "Memory usage for mcxload can be lowered by replacing the option --stream-mirror with -ri max." on https://micans.org/mcl/man/clmprotocols.html#large. I wonder if that's work in this way? This process is very slow for my data, it is possible to give mcxload more threads to convert? mcxload -abc data -ri max --write-binary -o data1.mcx

Here is a paper about the classification of DNA viruses: https://www.nature.com/articles/s41564-021-00928-6 Here is the code in the above paper: https://github.com/snayfach/MGV The Readme in the folder named aai_cluster show that perform MCL-based clustering.

micans commented 4 weeks ago

I see two aspects here. First the paper states:

To identify genus- and family-level vOTUs, we clustered viral genomes using a combination of gene sharing and AAI. For computational efficiency, only the longest genome per species-level vOTU was included. Blastp from the DIAMOND package v.0.9.25.126 was used with options ‘-e-value 1 × 10−5–max-target-seqs 10,000’ to align all viral proteins. For each pair of genomes, we identified shared genes (e-value <1 × 10−5), computed their AAI, and computed the percentage of genes shared.

Have you done the same or a similar thing, with same/similar parameters?

Second, in #8 there are many more edges; the submitter stated that loading the matrix took about 0.7T of memory for loading 107744198225 edges. You have 1788352905 edges, which is a lot (60 times) less. Hence it is a bit puzzling that you would have issues loading the matrix. One aspect is that it is quite possibly not proportional due to pre-allocation strategy and front-loaded costs when data gets streamed in (later data leading to fewer new allocations). Anyway, yes,

mcxload -abc data -ri max --write-binary -o data1.mcx

should work off the top of my head.

zhengqibonan commented 4 weeks ago

Thanks. I heve done the the similar thing with same parameters, following the pipeline. Thank you very much for the explanation of the memory allocation. But the process is too slow, there is any idea to speed up the process?

micans commented 4 weeks ago

Am I correct in that you already loaded this data before and saved it in data.mci? In that case you can just do

mcx convert data.mci data.mcx

After that I fear we will get back to where you started:

___> Vector with idx [115], maxval [0.000133] and [1185220] entries

indicates something weird going on with node degrees. It will be useful to run one of the following (on a large memory machine):

mcx query -imx data.mcx  -tab data.tab -o data.info    # once you have the file data.mcx
mcx query -imx data.mci  -tab data.tab -o data.info     # use this if you only have data.mci

The file data.info will contain information about node degree et cetera. I had forgotten to write -write-tab data.tab in earlier mcxload command lines; it is quite important to keep it in. As for your question about threads, mcxload is not threaded. It is possible to parallelise mcxload at the (linux) process level and add the results back, but this is more involved. If data.mci is as you want it then running mcx convert is simplest, and hopefully you have data.tab lying around as well.

If you have the file data.tab it may be worth checking it is as you expect it to be, e.g.

head data.tab

zhengqibonan commented 3 weeks ago

Thank you very much for the reminder about mci format and mcx command. I am trying to convert abc format to binary format (mcx format), but it's taking a long time. Is there only one way to convert abc format to mci or mcx? mcxload -abc data --stream-mirror -write-tab data.tab -o data.mci mcxload -abc xxx.tsv --stream-mirror --write-binary -o data.mcx

micans commented 3 weeks ago

mcxload is the only way, correct. Loading billions of edges will always take a lot of time.
there are ways to use multiple machines but this requires a lot of unix scripting and there is no standard recipe to follow. Unfortunately this is not something I can support at the moment.
you will nearly always want to have the option -write-tab somefilename for mcxload (missing in your second command line above).

How did you make the file data.mci in the first message in this issue? Did you try to convert it to mcx format (mcx convert data.mci data.mcx) or is there a reason you don't want to use that file?

zhengqibonan commented 3 weeks ago

Thank you for your patient explanation! The file data.mci in the first message in this issue is just a test, part of my data.

micans commented 3 weeks ago

How big is your actual data?
I suggest to follow some of my suggestions for a subset of your data, for example the mcx query command.
I would test on different sizes: start with 100k nodes, go up to 500k, then 2M. Testing on 100k nodes should be very fast, certainly compared to your current experiences.

WIth this size of data it is paramount that you can be confident in the approach. To be confident, one needs to understand the different stages, and in this case especially, the characteristics of the network that is created. It is impossible to have an iterative cycle of testing/inspecting/improving on the full-scale data. The best piece of advice that I can give is to run much smaller-scale tests first that allow you to compute QC. As said earlier, I would very much focus on the degree of the nodes in the network in the first instance, but there are other things as well, e.g. consider a histogram of edge weights.

I expect there to be nodes of very high degree in your network, and it would be useful to know what they are and then to do something about it. Perhaps some region of sequence space is very highly represented in your data (many sequences that are highly similar). In a case such as this it seems natural to use something such as CD-hit/uclust/linclust (edit: another tool diamond) to reduce this redundancy, possibly in a hybrid approach that also involves mcl, but this is not straightforward - unless someone has already done something like this.

A minor thing that might help is that it is possible to create a tab file yourself; it is simply a two-column file, with integers 0 1 2 ... in the left column and the labels you are using in the right column, separated by a <TAB> character. You can then do e.g.

mcxload -abc data.tsv -strict-tab mytab.txt --write-binary -ri max -o data.mcx

It is important that your tab file contains all the labels that are seen in the input file. Both pre-defining a tab file and using -ri max should help a bit with increasing the speed.

micans commented 3 weeks ago

Hello, I'm closing this issue. Feel free to continue the conversation, in which case I will convert it from an issue to a discussion.

zhengqibonan commented 2 weeks ago

Hi, I had to bother you again. In the past weeks, I had tried to convert my data from abc format binary format (mcx format) by the command : mcxload -abc xxx.abc -ri max --write-binary -o xxx.mcx But it failed. The file size of my original abc was 9T, separated by a character. When I tried to convert my data from abc format binary format, the RAM uesd was more than 1.5T, and the program was killed automatically. I notice that in #8, they uesd orthofinder to do the same things I did, and after I read the orthofinder program, I wonder if the way of making matrix in the class WaterfallMethod of the orthofinder program is same with your mcl, like the mcx result. mcxload -abc xxx.abc -ri max --write-binary -o xxx.mcx

micans commented 2 weeks ago

Hello. What do the first ten lines of xxx.abc look like? Can you show the output of

head -n 10 xxx.abc

please?

zhengqibonan commented 2 weeks ago

13822300 3586609 0.04214285714285714 24933647 18214837 0.0675 6239657 6406184 0.13725 8540325 473957 0.19399999999999998 21883751 1100004 0.17125 9635863 781312 0.1705 24891007 1582743 0.10149999999999999 8587746 3920950 0.14666666666666667 24225933 12416091 0.23766666666666666 2262009 2749579 0.05471428571428571

The table has three columns, the first column is for protein A,the second column is for protein B, and the third column is the similarity between the two proteins.

micans commented 2 weeks ago

The main thing still is to understand the data. The orthofinder code does some substantial filtering, by the looks of it, and the code is aware of the species being compared when it does so. Is your data already filtered in some (species-comparison-aware) manner? If not, I suggest as a simple starting point to create a histogram of either your scores, or the log-transform of your scores. If it has different modalities it may be possible to filter out the lowest modality entirely. I would look at the histogram of the 100k first scores whether it needs transforming or not. How are the scores in the third column derived; what do they represent - E-values?

zhengqibonan commented 2 weeks ago

The scores in the third column represent amino acid identity (AAI). My data was already filtered as mentioned before by filter_aai program in the Readme in the folder named aai_cluster of MGV. Here is the code in the above paper: https://github.com/snayfach/MGV.

micans commented 2 weeks ago

At the MGV page they give this:

Filter edges and prepare MCL input
python filter_aai.py --in_aai aai.tsv --min_percent_shared 20 --min_num_shared 16 --min_aai 40 --out_tsv genus_edges.tsv
python filter_aai.py --in_aai aai.tsv --min_percent_shared 10 --min_num_shared 8 --min_aai 20 --out_tsv family_edges.tsv

you could vary those parameters to become more stringent.
I see --min_aai 20 but in your data the scores are below 1. Is there some normalisation step somewhere?

However, it is not the right engineering approach to only focus on your full-scale data. You seem to be stuck banging your head against a giant dataset. In your situation, I would at the very start have tried to run through the entire processing pipeline using only 10% of the data. Or 5%. Or whatever is feasible. It is the best advice I can give, I've given it a few times now. This processing pipeline should not be treated like a black box that you just aim to squeeze your data through. To understand the data, you need to have some awareness of how it is sensitive to parameters and how the data is distributed. In my experience, this is best achieved by gradually scaling up the size of the data that you process. Smaller data sizes (that can still be large) allow you to create understanding of the data, and importantly, it allows you to do so within reasonable time and with reasonable compute demands.

I don't have the means to make mcxload suddenly work on the 12T dataset, and without knowing anything about the data I cannot infer why it takes so much memory.

zhengqibonan commented 2 weeks ago

Thanks for your patience! I also tried smaller data sizes to convert my data from abc format to binary format based on your suggestions, but full-scale data did't work due to memory. In the #8, it has more dimensions than me and it is about 1.5T in the matrix file of mci format. I now try to convert my data from abc to the matrix file of mci format using orthofinder, I wonder if the way of making matrix in the class WaterfallMethod of the orthofinder program is same with your mcxload command?

micans commented 2 weeks ago

You're also very patient :-) I had a quick look at the WaterFallMethod (but stopped after that). I expect a lot of differences in how things are computed between mcxload and the orthofinder program. It is likely that orthofinder is able to write the matrix to file without having to load it into memory first (as it can output a node x when it knows no more neighbours of x will be seen; this is not the case for mcxload). In that case it will need less memory.

Loading abc data into mcxload to create a matrix M will take less memory than running mcl on M. So I don't believe the problem that needs solving is creating M. The problem that needs solving is why M is so difficult to create; additionally if I remember the beginning of our conversation, there is something of a surprise/mystery that this is so difficult / memory-intensive if we compare your data dimensions (nodes, edges) to those in #8.

What is the biggest test data you have loaded? I would need to see a histogram of node degrees (can be obtained from output of mcx query) for that test data, and I think you would benefit greatly from also generating a histogram of edge weights. I would initially do this for, say, one data set with 1M nodes (and all their edges), and another with 2M nodes.

zhengqibonan commented 1 week ago

Hi! I think the reason why orthofinder needs less memory is that it analyzes the results of only the DIAMOND of each FAA file at once to get the analysis matrix. Finally it connects all the matrices.

Some of the test data is not representative of all, and sooner or later I'll have to calculate all the data all the time. The big difference between me and #8 is that he uses matrix and I use abc format. He was able to go on and on, but I didn't succeed in going down with the abc format, so I wanted to convert to a matrix format.

Do you have any ideas for connecting matrix?

micans commented 1 week ago

(a) If mcxload does not work for this data (memory issues), then mcl will also not work (even bigger memory issues)

(b) The very first data you showed implied a network with some very high node degrees, for which I bet that mcl will not work. Is there any reason to assume that your actual (bigger) data is different in this respect?

(c) You say that a smaller dataset is not representative. There are many ways to look at this.

(c1) Surely this is a matter of degree: it may not be representative, but it will be representative to some extent. (c2) A smaller dataset need not be representative for your entire scientific question. The idea is that it will give you insights into (c2a) the computational bottlenecks (c2b) characteristics of the networks you are building. I'm sure that (c2b) will be sufficiently representative that it will be useful. (c3) This is a minor point, but if it is not sufficiently representative, try harder? It should be possible to have smaller datasets that are still within the same family of problem that you are trying to solve. (c4) To my mind, understanding large datasets requires having QC (node degrees and edge weights for a start), and an understanding of how the data behaves as it grows. With 12T of data, it is crucially important. This is the way.

As an example, from a smaller dataset you might learn that there are many nodes with low weight and high degree. By inspecting those nodes you might learn that they tend to be shorter sequences. This would then suggest to treat those sequences differently or filter them (this is one reason why cd-hit etc start with longer sequences as the seeds of clusters).

(d) It is too simple to look at #8 and say that the only difference is mcxload and that the problem is thus mcxload. Because of the observations in (a) and (b) the problem is more likely to be in your data somehow.

To make this conversation productive, I would appreciate it if you could comment on all three of (a), (b), and (c).

PS one thing to check is the output of mcl -z, on the off chance that you have an mcl that is compiled to use more expensive primite types. If it says cell size 8 it's fine. If it says cell size 16 mcl will use twice the memory compared to default setting. However, even if this were the case, all the above, especially (b), still apply.

PS2 by smart partitioning of the data it is quite likely possible to load it with mcxload, but I cannot spend time on this as long as (a) and (b) are not addressed. The network to be constructed needs to be free of nodes with very high degree.

micans / mcl

12T data to process #31