soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.31k stars 184 forks source link

updating a near complete mmseq clustered db #348

Open intikhab opened 3 years ago

intikhab commented 3 years ago

Hi mmseq team,

I need some help in updating a previous mmseq based (near complete) clustered db (A) with additional sequences (B).

My mmseq run for A, using easy-cluster, timed out at the stage of moving result files and deleting temporary files, a log and resulting files list is attached. mmseq.earthbiome.fna.2tb.log.gz resultfiles.txt

Now I have additional sequences as set B that I want to use to update set A to perform clustering and updating the final database.

mmseq GitHub documentation says about updating a database as:

mmseqs createdb DB_trimmed.fasta DB_trimmed mmseqs cluster DB_trimmed DB_trimmed_clu tmp

To update the clustering DB_trimmed_clu with the new version of your database DB_new:

mmseqs createdb DB.fasta DB_new mmseqs clusterupdate DB_trimmed DB_new DB_trimmed_clu DB_new_updated DB_update_clu tmp

My worry is if If I cluster bigger dataset A again, it will take quite a few days, as in the last case the job was terminated on a 3 Tb memory machine after 7 days, without completing the last step of moving results and deleting temp files.

In summary I need help in 1. saving or moving important result files and safely remove temporary files. and 2. a correct way of updating and clustering my dataset B with database A.

Many Thanks,

IA

milot-mirdita commented 3 years ago

You are running a very old version of MMseqs2. Please update to the latest version. Especially cluster updating had multiple severe issues before the latest release.

It was also spending 35h in the very simple module result2repseq. I think I fixed the performance issue in the latest commit cc7d7da30ec779d6a2e886438f8295f59e2192f1. You'll find statically compiled binaries here in about one hour: https://mmseqs.com/latest

Cluster updating also doesn't interact very nicely with the easy- workflows yet. I'd recommend to stick to the basic commands as shown in the user guide.

intikhab commented 3 years ago

Dear Milot,

Thanks you for recommending the use of updated version and fixing performance issues.

There were 3 billion sequences which clustered in 1 billion using was-cluster approach. I do not want to re-cluster the version of data I processed previously since I annotated these ~1 billion sequences and used in different projects already.

As it seems, It may not be good idea to use db files from easy-cluster output. Do you think it is useful to create a new db (EBdb) out of easy-cluster output of the previous run, using rep_seq.fasta, and use EBdb as a template to compare newer sequences I want to cluster and finally update the EBdb to EBdb_new?

Intikhab

--

Intikhab Alam, PhD

Research Scientist Computational Bioscience Research Centre (CBRC), Building #3, Office #4328 4700 King Abdullah University of Science and Technology (KAUST) Thuwal 23955-6900, KSA W: http://www.kaust.edu.sahttps://webmail.kaust.edu.sa/owa/redir.aspx?C=wkduJ0ChSE-OkyUQwL9vutDH6L5Gg9EImiJ7GyYOxcPLuActd9iwo85DHDgQZup2zR1MyXCk7as.&URL=http%3a%2f%2fwww.kaust.edu.sa T +966 (0) 2 808-2423 F +966 (2) 802 0127


From: Milot Mirdita notifications@github.com Sent: 08 September 2020 12:42 To: soedinglab/MMseqs2 Cc: Intikhab S. Alam; Author Subject: Re: [soedinglab/MMseqs2] updating a near complete mmseq clustered db (#348)

You are running a very old version of MMseqs2. Please update to the latest version. Especially cluster updating had multiple severe issues before the latest release.

It was also spending 35h in the very simple module result2repseq. I think I fixed the performance issue in the latest commit cc7d7dahttps://github.com/soedinglab/MMseqs2/commit/cc7d7da30ec779d6a2e886438f8295f59e2192f1. You'll find statically compiled binaries here in about one hour: https://mmseqs.com/latest

Cluster updating also doesn't interact very nicely with the easy- workflows yet. I'd recommend to stick to the basic commands as shown in the user guide.

- You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/soedinglab/MMseqs2/issues/348#issuecomment-688751441, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAV63EW3N5Z32FV3JA5T3JDSEX37XANCNFSM4Q7W6YLA.

milot-mirdita commented 3 years ago

I think it's probably salvageable. I'll need to look over your output in detail.

Another question: This was clustered using easy-linclust not using easy-cluster right? Cluster update currently will try to use the normal clustering for the sequences that cannot be assigned to an existing cluster. This will also likely be very slow (probably slower than clustering anew using linclust). We will have to build support for updating with linclust.

intikhab commented 3 years ago

Dear Milot,

I provided all result files from previous easy-linclust of 3 billion sequences.

As I understand, if you use the rep_seq based new database, the update of additional sequence will not place newer sequences to any existing clusters, is this right?

The only option seems to add newer redundant sequences to previous version of redundant sequences, create a database and start a fresh linclust session. You mentioned new versions of mmseq is now much faster, I will give it a go but do you think is there a way to extract the cluster db of my previous run where I used easy-linclust?

Please provide some useful advice as these are computationally heavy tasks.

Best,

IA

--

Intikhab Alam, PhD

Research Scientist Computational Bioscience Research Centre (CBRC), Building #3, Office #4328 4700 King Abdullah University of Science and Technology (KAUST) Thuwal 23955-6900, KSA W: http://www.kaust.edu.sahttps://webmail.kaust.edu.sa/owa/redir.aspx?C=wkduJ0ChSE-OkyUQwL9vutDH6L5Gg9EImiJ7GyYOxcPLuActd9iwo85DHDgQZup2zR1MyXCk7as.&URL=http%3a%2f%2fwww.kaust.edu.sa T +966 (0) 2 808-2423 F +966 (2) 802 0127


From: Milot Mirdita notifications@github.com Sent: 08 September 2020 18:53 To: soedinglab/MMseqs2 Cc: Intikhab S. Alam; Author Subject: Re: [soedinglab/MMseqs2] updating a near complete mmseq clustered db (#348)

I think it's probably salvageable. I'll need to look over your output in detail.

Another question: This was clustered using easy-linclust not using easy-cluster right? Cluster update currently will try to use the normal clustering for the sequences that cannot be assigned to an existing cluster. This will also likely be very slow (probably slower than clustering anew using linclust). We will have to build support for updating with linclust.

- You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/soedinglab/MMseqs2/issues/348#issuecomment-688971058, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAV63ERPGWSKH3ZELTLYWA3SEZHO5ANCNFSM4Q7W6YLA.

intikhab commented 3 years ago

?Dear Milot,

I restarted the clustering to make a new DB, EarthMicrobiomeDB20200910, that includes previous representative DNA gene sequences (~800 million) and recently obtained redundant sequences (around 12 million), using the following command:

mmseqs cluster EarthICEmetagenomesDB EarthICEmetagenomesDB_clu EarthICEmetagenomesDB.tmp --min-seq-id 0.95 --cov-mode 2 -c 0.8 --max-seq-len 132768 --threads 32 >EarthICEmetagenomesDB_clu.log 2>&1 &

It is running since September 10, attached is the log, what do you think how long it will take to finish?

?Many Thanks,

IA

--

Intikhab Alam, PhD

Research Scientist Computational Bioscience Research Centre (CBRC), Building #3, Office #4328 4700 King Abdullah University of Science and Technology (KAUST) Thuwal 23955-6900, KSA W: http://www.kaust.edu.sahttps://webmail.kaust.edu.sa/owa/redir.aspx?C=wkduJ0ChSE-OkyUQwL9vutDH6L5Gg9EImiJ7GyYOxcPLuActd9iwo85DHDgQZup2zR1MyXCk7as.&URL=http%3a%2f%2fwww.kaust.edu.sa T +966 (0) 2 808-2423 F +966 (2) 802 0127


From: Intikhab S. Alam Sent: 10 September 2020 00:45 To: soedinglab/MMseqs2; soedinglab/MMseqs2 Cc: Author Subject: Re: [soedinglab/MMseqs2] updating a near complete mmseq clustered db (#348)

Dear Milot,

I provided all result files from previous easy-linclust of 3 billion sequences.

As I understand, if you use the rep_seq based new database, the update of additional sequence will not place newer sequences to any existing clusters, is this right?

The only option seems to add newer redundant sequences to previous version of redundant sequences, create a database and start a fresh linclust session. You mentioned new versions of mmseq is now much faster, I will give it a go but do you think is there a way to extract the cluster db of my previous run where I used easy-linclust?

Please provide some useful advice as these are computationally heavy tasks.

Best,

IA

--

Intikhab Alam, PhD

Research Scientist Computational Bioscience Research Centre (CBRC), Building #3, Office #4328 4700 King Abdullah University of Science and Technology (KAUST) Thuwal 23955-6900, KSA W: http://www.kaust.edu.sahttps://webmail.kaust.edu.sa/owa/redir.aspx?C=wkduJ0ChSE-OkyUQwL9vutDH6L5Gg9EImiJ7GyYOxcPLuActd9iwo85DHDgQZup2zR1MyXCk7as.&URL=http%3a%2f%2fwww.kaust.edu.sa T +966 (0) 2 808-2423 F +966 (2) 802 0127


From: Milot Mirdita notifications@github.com Sent: 08 September 2020 18:53 To: soedinglab/MMseqs2 Cc: Intikhab S. Alam; Author Subject: Re: [soedinglab/MMseqs2] updating a near complete mmseq clustered db (#348)

I think it's probably salvageable. I'll need to look over your output in detail.

Another question: This was clustered using easy-linclust not using easy-cluster right? Cluster update currently will try to use the normal clustering for the sequences that cannot be assigned to an existing cluster. This will also likely be very slow (probably slower than clustering anew using linclust). We will have to build support for updating with linclust.

- You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/soedinglab/MMseqs2/issues/348#issuecomment-688971058, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAV63ERPGWSKH3ZELTLYWA3SEZHO5ANCNFSM4Q7W6YLA.

milot-mirdita commented 3 years ago

Ah sorry I had forgotten about your previous message, too many different projects are going on!

Could you check the log-file upload again, I don't see a log.

intikhab commented 3 years ago

?Dear Milot,

I attached the log file here with this email, please check if you can access it.

Many Thanks,

IA

--

Intikhab Alam, PhD

Research Scientist Computational Bioscience Research Centre (CBRC), Building #3, Office #4328 4700 King Abdullah University of Science and Technology (KAUST) Thuwal 23955-6900, KSA W: http://www.kaust.edu.sahttps://webmail.kaust.edu.sa/owa/redir.aspx?C=wkduJ0ChSE-OkyUQwL9vutDH6L5Gg9EImiJ7GyYOxcPLuActd9iwo85DHDgQZup2zR1MyXCk7as.&URL=http%3a%2f%2fwww.kaust.edu.sa T +966 (0) 2 808-2423 F +966 (2) 802 0127


From: Milot Mirdita notifications@github.com Sent: 28 September 2020 14:48 To: soedinglab/MMseqs2 Cc: Intikhab S. Alam; Author Subject: Re: [soedinglab/MMseqs2] updating a near complete mmseq clustered db (#348)

Ah sorry I had forgotten about your previous message, too many different projects are going on!

Could you check the log-file upload again, I don't see a log.

- You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/soedinglab/MMseqs2/issues/348#issuecomment-699957190, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAV63ERYGGXCIXN2ZXRHM3LSIBZXBANCNFSM4Q7W6YLA.

milot-mirdita commented 3 years ago

I think you have to send it either directly to my email or upload it via GitHub. I think GitHub strips attachments from emails sent to the issue tracker.