soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.37k stars 190 forks source link

SAM file formatting #845

Open Naturalist1986 opened 4 months ago

Naturalist1986 commented 4 months ago

Expected Behavior

Hi,

I'm trying to convert a SAM file created from a search result but I get this error:

(mmseqs) moshea@shannon:/mnt/scratch$ samtools view -Sb SRR5234505.sam > SRR5234505.bam [E::sam_parse1] CIGAR and query sequence are of different length [W::sam_read1_sam] Parse error at line 363193 samtools view: error reading file "SRR5234505.sam"

This is the part of the file where the error refers to:

(mmseqs) moshea@shannon:/mnt/scratch$ sed -n '363193p' SRR5234505.sam

SRR5234505.241048 16 NZ_SMKV01000006.1_67 26 254 35M 0 0 GGTCTTACCTGCGCTCCACAGGCTGTCACCGCCAGCCCGGCGGCCTTCGCGACGTTGACCGCCGCGTTGATGTCCCGGTCCAGGTACGTCCCGCACGACCCGCAC AS:i:98 NM:i:12

Maybe the sam file is not formatted properly?

I used:

mmseqs convertalis SRR5234505_DB curated_protein_catalogue SRR5234505_Result SRR5234505 --format-mode 1

Thanks for your help!

milot-mirdita commented 4 months ago

If I understand this correctly, this is a nucleotide-vs-protein search (translated search), right?

I think we have a bug for the SAM backtraces for this mode, where either the back trace should be converted to the nucleotide match count (not codon match count), or the nucleotide sequence should be translated to a protein sequence.

milot-mirdita commented 4 months ago

Do you expect this search to return a nucleotide sequence or a protein sequence?

genomewalker commented 4 months ago

@milot-mirdita, we are also interested in the SAM formatted output for translated searches. In our use case, we would like to get a nucleotide output. Many thanks!

Naturalist1986 commented 4 months ago

I'm searching a metagenome (nucleotides) against a protein catalogue. I need to know how many reads mapped to each protein from the catalogue for functional analysis. I guess I expect the search to return the matches for each read?

milot-mirdita commented 4 months ago

I pushed a fix that should correct some of the weirdness with SAM. Please check if this is what you expected as we don't use much sam internally.

Naturalist1986 commented 4 months ago

Thanks! Do I need to download and recompile mmseqs? or can I just replace with the fixed file?

milot-mirdita commented 4 months ago

You can download precompiled binaries from: https://mmseqs.com/latest/

you don’t need to compile them yourself