soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.36k stars 190 forks source link

Change in pseudocount behavior #547

Open apcamargo opened 2 years ago

apcamargo commented 2 years ago

I've been evaluating how adding pseudocounts change the sensitivity of profile searches.

mmseqs msa2profile msa_db/msa_db profile_db/profile_db --match-mode 1 --match-ratio 0.5 --threads 64
mmseqs msa2profile msa_db/msa_db profile_db_pseudo/profile_db --match-mode 1 --match-ratio 0.5 --threads 64 --pca 0.3

I noticed, however, that the search results are different depending on the version of MMSeqs2. If I use the latest GitHub/Conda release (13-45111), the search on the profile_db_pseudo will provide more results (as expected, given that the alignments are not very diverse). If I use a newer release (92deb92fb46583b4c68932111303d12dfa121364), the search on the database with pseudocounts will results in less hits.

mmseqs easy-search --threads 64 fragment_sequences.faa profile_db_pseudo/profile_db mmseqs2_results_pseudo tmp

Were there any changes in MMSeqs2's behavior regarding pseudocounts? Also, are there recommendations about how to use the --pca parameter?

milot-mirdita commented 2 years ago

Yes, this code has changed a lot in preparation for the profile-profile search. Martin has fitted new values. I'd recommend to use the new default values for them. There is now also two different pseudo count modes, the new ones is similar to the HHblits pseudocounts and much slower.

apcamargo commented 2 years ago

Thanks, Milot!

The new mode is --pseudo-cnt-mode 1 (context-specific)? And what are the new --pca and --pcb default values? They are not showing up in the help dialogue.

 --pca                        Pseudo count admixture strength []
 --pcb                        Pseudo counts: Neff at half of maximum admixture (range 0.0-inf) []

My limitation is that this is part of a package that will be distributed in Conda, so I need to be compatible with the MMSeqs2 version that is on Conda. Profile databases created with the latest version will fail if I try to search them with 13-45111. But I could try to use the new default --pca and --pcb when creating the profile database with 13-45111.

Do you guys have plans to push a new GitHub/Conda release in the near future?

milot-mirdita commented 2 years ago

Ah that looks like a bug, it should print out the default value.

The new values are:

    pca = MultiParam<PseudoCounts>(PseudoCounts(1.1, 1.4));
    pcb = MultiParam<PseudoCounts>(PseudoCounts(4.1, 5.8));

The first value is --pseudo-cnt-mode 0 the second one is --pseudo-cnt-mode 1

Profile databases with the newer commits won't work anymore with 13 and before.

Yes, we are planing to make a new release, but there is a lot going on :/ Hopefully soon.

apcamargo commented 2 years ago

Thanks!

So, if I create a profile database in 13-45111 with a command like this:

mmseqs msa2profile msa_db/msa_db profile_db_pseudo/profile_db --match-mode 1 --match-ratio 0.5 --threads 64 --pca 1.1 --pcb 4.1

It should give me a database with the same pseudocounts as the default parameters of the newer releases? I know that there were other changes in the way profile databases work, but I wanted to improve sensitivity and stay compatible with the Conda release.