soedinglab / metaeuk

MetaEuk - sensitive, high-throughput gene discovery and annotation for large-scale eukaryotic metagenomics
GNU General Public License v3.0
178 stars 23 forks source link

[Question] Can MetaEuk utilize the MMSEQS2 clustering results? #78

Open jolespin opened 1 year ago

jolespin commented 1 year ago

I have a pretty large database: https://zenodo.org/record/7485114#.ZF7RdOzML0o that I use in the backend of VEBA and I'm trying to decrease the resource needs.

I'm wondering if MetaEuk can handle the clustering results of MMEQS2 easy-cluster or easy-lincluster? If not, let's say one used clustered representatives as the database for finding exons. If you were to do this, what minimum coverage and percent identity would you use in MMEQS2 to capture (most of) the exons?

elileka commented 1 year ago

Hi,

Neat project!

On the reference side, MetaEuk can use protein profiles, so you could cluster the proteins (using linclust) and compute profiles (using result2profile) from each cluster. You could of course, also use cluster representatives, as you suggest.

How to choose the clustering parameters is a good question. I would start by setting the value of --cov-mode to either 1 or 3 and -c to, say 0.8. See here. I guess it is worth it to test on a subsample of your DB.

@milot-mirdita, any wise words about clustering and profiles using MMseqs2?

jolespin commented 1 year ago

Thanks! I'm loving the MMSEQS2 and MetaEuk ecosystem. I'll look into results2profile in a bit.

What I was thinking may or may not be possible but it would be cool if MetaEuk could take in a clustered database the has cluster mappings and the full sequence set. Once it identifies a hit in the cluster representative, it could search for more exons in the proteins within a cluster so essentially it performs MetaEuk twice once on a large coarse level and then on a smaller subset of proteins with higher granularity.