soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
MIT License
1.44k stars 195 forks source link

Inaccurate "nident" values #349

Closed h836472 closed 4 years ago

h836472 commented 4 years ago

Expected Behavior

MMSeqs search followed by MMSeqs convertalis --format-output "query,target,pident,nident" should export the number of identical matches between query and target sequences

Current Behavior

MMSeqs always reports the "nident" (number of identical residues) value to be 0.

Steps to Reproduce (for bugs)

Please run bash script below to reproduce error

!/bin/bash

download protein sequences from Pyrococcus furiosus

wget -c https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/008/245/085/GCF_008245085.1_ASM824508v1/GCF_008245085.1_ASM824508v1_protein.faa.gz

uncompress protein sequence

gunzip GCF_008245085.1_ASM824508v1_protein.faa.gz

create MMSeqs database

mmseqs createdb GCF_008245085.1_ASM824508v1_protein.faa GCF_008245085.1 >createdb.log

perform all_vs_all search on proteins of the genome

mmseqs search GCF_008245085.1 GCF_008245085.1 GCF_008245085.1.selfDB /tmp >search.log

export results to a custom text file Q H pident nident

mmseqs convertalis GCF_008245085.1 GCF_008245085.1 GCF_008245085.1.selfDB GCF_008245085.1.self.txt --format-output "query,target,pident,nident" >convertalis.log

check output file

head GCF_008245085.1.self.txt

Please make sure to execute the reproduction steps with newly recreated and empty tmp folders.

MMseqs Output (for bugs)

Please make sure to also post the complete output of MMseqs. You can use gist.github.com for large output.

MMSeqs log files are available upon request.

Context

Providing context helps us come up with a solution and improve our documentation for the future.

Your Environment

Include as many relevant details about the environment you experienced the bug in.

milot-mirdita commented 4 years ago

MMseqs2 approximates the sequence identity by default (https://github.com/soedinglab/MMseqs2/wiki#how-does-mmseqs2-compute-the-sequence-identity). You'll have to pass the -a or --alignment-mode 3 parameter to search to compute the full alignments instead of only the faster computable alignment scores.

h836472 commented 4 years ago

Thank you for the prompt answer!

Indeed, adding -a and --alignment-mode 3 switches resolve the issue.

Thank you. Balazs