steineggerlab / foldseek

Foldseek enables fast and sensitive comparisons of large structure sets.
https://foldseek.com
GNU General Public License v3.0
780 stars 99 forks source link

The best probability and LDDT score to filter in easy-search #243

Open Jigyasa3 opened 7 months ago

Jigyasa3 commented 7 months ago

Hi @martin-steinegger ,

Thank you again for a great resource! I am using the foldseek easy-search command to annotate some proteins of interest. I am selecting the annotation with the highest prob and LDDT score for each protein. I wanted to confirm if there is a filter that I can use to confidently say what the putative annotation is for the protein of interest? For example, I have several hits that have prob of >0.7, but the LDDT score <0.3. While most of the proteins have prob of >0.7 and LDDT score >0.5. What is the "best" cutoff for annotating proteins using Foldseek?

At the same time, where can I find the target protein description? If my target protein is MGYP001275795760, where can I find its full name?

Any suggestions?

milot-mirdita commented 7 months ago

The safest cut-off is neither prob nor LDDT/TM-score (in our opinion), since neither has a multiple testing correction in-built. When searching against potentially hundreds of millions of entities, E-value will likely be the most/only reliable indicator of homology for annotation. In your range, its probably not possible to say for certain that either of the hits are reliable annotations. All of them have probably high E-values? With high E-values and uncertain LDDT/TM-score/prob we can just establish that there is some structural similarity to be found; for stronger statements additional evidence is required.

The MGYP proteins come from MGnify. You can find the source assembly from the metadata on the MGnify download server: http://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/current_release/

Specifically the [mgy_assemblies.tsv.gz](http://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/current_release/mgy_assemblies.tsv.gz) file. I don't think that the EBI offers a service yet to map MGYP accessions to their source.

Jigyasa3 commented 7 months ago

Hi @milot-mirdita ,

Thank you for replying! I wanted to confirm another thing, while the E. values of the results are high, the alignment length of the match varies a lot! Some proteins have an alignment length of less than 50 amino acids (but high probability, LDDT score, and E.value). I was wondering if these proteins can be considered as remote homologs? Or would you suggest a more stringent filtering criterion for defining remote homologs?

Regards, Jigyasa

milot-mirdita commented 7 months ago

Just to clarify and make sure that there is no miscommunication or typos: A high value for E-values is bad. E-values should be as low and close to 0 as possible. E-values of < 10^-3 are normally very certain homologs. For higher values you'd need other evidence to establish homology.

Jigyasa3 commented 7 months ago

Hi @milot-mirdita , I am comparing the output from Foldseek with hh-suite to find remote homologs, and I observe that none of the hits have E. values less than 1e-3. Link to the open issue. Is there a way to examine false negatives?