Closed twaksman001 closed 1 year ago
If you only want to view the content you can use
import pandas as pd
data = pd.read_csv("PATH/TO/.m8", delimiter="\t", header=None)
# view data
print(data)
Then you can process it like you are used to with pandas
@gwirn thank you for the code.
@twaksman001 in the readme we explain the fields a bit. . The alignment output is a tab-separated file of the alignments (.m8) the fields are query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits
.
foldseek easy-search example/d1asha_ example/ aln.m8 tmpFolder
Output: Customize fields of tab seperated output
The output can be customized with the --format-output
option e.g. --format-output "query,target,qaln,taln"
returns the query and target accession and the pairwise alignments in tab separated format. You can choose many different output columns.
query Query sequence identifier
target Target sequence identifier
evalue E-value
gapopen Number of gap open events (note: this is NOT the number of gap characters)
pident Percentage of identical matches
fident Fraction of identical matches
nident Number of identical matches
qstart 1-indexed alignment start position in query sequence
qend 1-indexed alignment end position in query sequence
qlen Query sequence length
tstart 1-indexed alignment start position in target sequence
tend 1-indexed alignment end position in target sequence
tlen Target sequence length
alnlen Alignment length (number of aligned columns)
raw Raw alignment score
bits Bit score
cigar Alignment as string. Each position contains either M (match), D (deletion, gap in query), or I (Insertion, gap in target)
qseq Query sequence
tseq Target sequence
qaln Aligned query sequence with gaps
taln Aligned target sequence with gaps
qheader Header of Query sequence
theader Header of Target sequence
mismatch Number of mismatches
qcov Fraction of query sequence covered by alignment
tcov Fraction of target sequence covered by alignment
empty Dash column '-'
taxid Taxonomical identifier (needs mmseqs tax db)
taxname Taxon Name (needs mmseqs tax db)
taxlineage Taxonomical lineage (needs mmseqs tax db)
qset Query filename of FASTA/Q (useful if multiple files were passed to createdb)
qsetid Numeric identifier for query filename
tset Target filename of FASTA/Q (useful if multiple files were passed to createdb)
tsetid Numeric identifier for target filename
qca Calpha corrdinates of the query
tca Calpha corrdinates of the target
alntmscore TM-score of the alignment
u Rotation matrix (computed to by TM-score)
t Translation vector (computed to by TM-score)
@martin-steinegger I'm confused a bit on the fields. In particular, when you download results from your server and load them into a pandas dataframe, it appears that there are 21 fields. On the other hand, query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits
comprises only 12. Could you please advise?
Nvm, found here: https://github.com/steineggerlab/foldseek/issues/25
For anyone loading these into pandas:
col_names = ["query","target","fident","alnlen","mismatch","gapopen","qstart","qend","tstart","tend","evalue","bits", "qlen","tlen","qaln","taln","tca","tseq","taxid","taxname"]
It seems like there's something still missing actually - this only gives 20 columns instead of 21...
We recently added the Foldseek match probability (prob
).
Also we return pident
not fident
and since recently for some of the databases theader
instead of target
to get the full header.
@milot-mirdita it might be a good idea to add prob
into the documentation - using the CLI doesn't give it as an option for --format-output
, nor is it included in the readme
Hi @milot-mirdita , would you mind confirming the current correct list of col_names = ["query","target","pident","alnlen","mismatch","gapopen","qstart","qend","tstart","tend","prob","evalue","bits", "qlen","tlen","qaln","taln"]
The output columns of the command line app are by default (foldseek easy-search --help):
query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits
The webserver outputs:
query,(target/theader),pident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,prob,evalue,bits,qlen,tlen,qaln,taln,tca,tseq,taxid,taxname
I cannot find much information on how to view or process .m8 files. Please advise?