steineggerlab / foldseek

Foldseek enables fast and sensitive comparisons of large structure sets.
https://foldseek.com
GNU General Public License v3.0
842 stars 104 forks source link

.m8 file - ? #59

Closed twaksman001 closed 1 year ago

twaksman001 commented 2 years ago

I cannot find much information on how to view or process .m8 files. Please advise?

gwirn commented 1 year ago

If you only want to view the content you can use

import pandas as pd
data = pd.read_csv("PATH/TO/.m8", delimiter="\t", header=None)
# view data
print(data)

Then you can process it like you are used to with pandas

martin-steinegger commented 1 year ago

@gwirn thank you for the code.

@twaksman001 in the readme we explain the fields a bit. . The alignment output is a tab-separated file of the alignments (.m8) the fields are query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits.

foldseek easy-search example/d1asha_ example/ aln.m8 tmpFolder Output: Customize fields of tab seperated output

The output can be customized with the --format-output option e.g. --format-output "query,target,qaln,taln" returns the query and target accession and the pairwise alignments in tab separated format. You can choose many different output columns.

query       Query sequence identifier 
target      Target sequence identifier
evalue      E-value
gapopen     Number of gap open events (note: this is NOT the number of gap characters)
pident      Percentage of identical matches
fident      Fraction of identical matches
nident      Number of identical matches
qstart      1-indexed alignment start position in query sequence
qend        1-indexed alignment end position in query sequence
qlen        Query sequence length
tstart      1-indexed alignment start position in target sequence
tend        1-indexed alignment end position in target sequence
tlen        Target sequence length
alnlen      Alignment length (number of aligned columns)
raw         Raw alignment score
bits        Bit score
cigar       Alignment as string. Each position contains either M (match), D (deletion, gap in query), or I (Insertion, gap in target)
qseq        Query sequence 
tseq        Target sequence
qaln        Aligned query sequence with gaps
taln        Aligned target sequence with gaps
qheader     Header of Query sequence
theader     Header of Target sequence
mismatch    Number of mismatches
qcov        Fraction of query sequence covered by alignment
tcov        Fraction of target sequence covered by alignment
empty       Dash column '-'
taxid       Taxonomical identifier (needs mmseqs tax db)
taxname     Taxon Name (needs mmseqs tax db)
taxlineage  Taxonomical lineage (needs mmseqs tax db)
qset        Query filename of FASTA/Q (useful if multiple files were passed to createdb)
qsetid      Numeric identifier for query filename
tset        Target filename of FASTA/Q (useful if multiple files were passed to createdb)
tsetid      Numeric identifier for target filename
qca         Calpha corrdinates of the query
tca         Calpha corrdinates of the target
alntmscore  TM-score of the alignment 
u           Rotation matrix (computed to by TM-score)
t           Translation vector (computed to by TM-score)
ronboger commented 1 year ago

@martin-steinegger I'm confused a bit on the fields. In particular, when you download results from your server and load them into a pandas dataframe, it appears that there are 21 fields. On the other hand, query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits comprises only 12. Could you please advise?

ronboger commented 1 year ago

Nvm, found here: https://github.com/steineggerlab/foldseek/issues/25

For anyone loading these into pandas: col_names = ["query","target","fident","alnlen","mismatch","gapopen","qstart","qend","tstart","tend","evalue","bits", "qlen","tlen","qaln","taln","tca","tseq","taxid","taxname"]

ronboger commented 1 year ago

It seems like there's something still missing actually - this only gives 20 columns instead of 21...

milot-mirdita commented 1 year ago

We recently added the Foldseek match probability (prob).

Also we return pident not fident and since recently for some of the databases theader instead of target to get the full header.

ronboger commented 1 year ago

@milot-mirdita it might be a good idea to add prob into the documentation - using the CLI doesn't give it as an option for --format-output, nor is it included in the readme

johnnytam100 commented 1 year ago

Hi @milot-mirdita , would you mind confirming the current correct list of col_names = ["query","target","pident","alnlen","mismatch","gapopen","qstart","qend","tstart","tend","prob","evalue","bits", "qlen","tlen","qaln","taln"]

mvankem commented 1 year ago

The output columns of the command line app are by default (foldseek easy-search --help): query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits

The webserver outputs: query,(target/theader),pident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,prob,evalue,bits,qlen,tlen,qaln,taln,tca,tseq,taxid,taxname