soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
MIT License
1.47k stars 200 forks source link

'uc' removed from deflines #557

Closed chasemc closed 2 years ago

chasemc commented 2 years ago

Expected Behavior

The two fasta files depicted below are identical except for the deflines:

pass.fasta

>zzsomething
MPELRRVLANGVELNVALCGSGPAVLLLHGFPHTWELWTDVMADLSGRYRVIAPDLRGFGASGRAASGYDAGTLAEDAAALLAALGVSSATVVGIDAGTAPAFLLALRHPGLVRRLVVMESLLGRLPGAEDFLAEGPPWWFGFHSAAPSLAETVLEGHEAAYVDWFLSAGTLGDGVRPALRDAFVRAYTGRQALSCAFSYYRALPKSAVQIEQAVATARLTVPTMALGARPVGAALERQLRPVTDDLTGHVIDDCGHIIPLHRPHALLALLHPFLAGEDAKAA
>zzsomethingelse
MPELRRVLANGVELNVALCGSGPAVLLLHGFPHTWELWTDVMADLSGRYRVIAPDLRGFGASGRAASGYDAGTLAEDAAALLAALGVSSATVVGIDAGTAPAFLLALRHPGLVRRLVVMESLLGRLPGAEDFLAEGPPWWFGFHSAAPSLAETVLEGHEAAYVDWFLSAGTLGDGVRPALRDAFVRAYTGRQALSCAFSYYRALPKSAVQIEQAVATARLTVPTMALGARPVGAALERQLRPVTDDLTGHVIDDCGHIIPLHRPHALLALLHPFLAGEDAKAA

fail.fasta

>ucsomething
MPELRRVLANGVELNVALCGSGPAVLLLHGFPHTWELWTDVMADLSGRYRVIAPDLRGFGASGRAASGYDAGTLAEDAAALLAALGVSSATVVGIDAGTAPAFLLALRHPGLVRRLVVMESLLGRLPGAEDFLAEGPPWWFGFHSAAPSLAETVLEGHEAAYVDWFLSAGTLGDGVRPALRDAFVRAYTGRQALSCAFSYYRALPKSAVQIEQAVATARLTVPTMALGARPVGAALERQLRPVTDDLTGHVIDDCGHIIPLHRPHALLALLHPFLAGEDAKAA
>ucsomethingelse
MPELRRVLANGVELNVALCGSGPAVLLLHGFPHTWELWTDVMADLSGRYRVIAPDLRGFGASGRAASGYDAGTLAEDAAALLAALGVSSATVVGIDAGTAPAFLLALRHPGLVRRLVVMESLLGRLPGAEDFLAEGPPWWFGFHSAAPSLAETVLEGHEAAYVDWFLSAGTLGDGVRPALRDAFVRAYTGRQALSCAFSYYRALPKSAVQIEQAVATARLTVPTMALGARPVGAALERQLRPVTDDLTGHVIDDCGHIIPLHRPHALLALLHPFLAGEDAKAA

Current Behavior / Steps to Reproduce (for bugs)

Running easy-cluster on these two files:

# rm was run, but commented out for placing on github
# rm -rf ./tmp

mmseqs \
    easy-cluster \
    fail.fasta \
    'mmseqs2_fail' \
    ./tmp \
    --threads 24

# rm was run, but commented out for placing on github
# rm -rf ./tmp

mmseqs \
    easy-cluster \
    pass.fasta  \
    'mmseqs2_pass' \
    ./tmp \
    --threads 24

results in the correct output for mmseqs2_pass_cluster.tsv:

zzsomethingelse zzsomethingelse
zzsomethingelse zzsomething

but removes the 'uc' from the defline in mmseqs2_fail_cluster.tsv

somethingelse   somethingelse
somethingelse   something

This seems to be the case for any deflines that start with 'uc'


The FASTA files also have duplicate defline entries, where one of the duplicates doesn't contain a sequence:

mmseqs2_fail_all_seqs.fasta

>somethingelse
>ucsomethingelse
MPELRRVLANGVELNVALCGSGPAVLLLHGFPHTWELWTDVMADLSGRYRVIAPDLRGFGASGRAASGYDAGTLAEDAAALLAALGVSSATVVGIDAGTAPAFLLALRHPGLVRRLVVMESLLGRLPGAEDFLAEGPPWWFGFHSAAPSLAETVLEGHEAAYVDWFLSAGTLGDGVRPALRDAFVRAYTGRQALSCAFSYYRALPKSAVQIEQAVATARLTVPTMALGARPVGAALERQLRPVTDDLTGHVIDDCGHIIPLHRPHALLALLHPFLAGEDAKAA
>ucsomething
MPELRRVLANGVELNVALCGSGPAVLLLHGFPHTWELWTDVMADLSGRYRVIAPDLRGFGASGRAASGYDAGTLAEDAAALLAALGVSSATVVGIDAGTAPAFLLALRHPGLVRRLVVMESLLGRLPGAEDFLAEGPPWWFGFHSAAPSLAETVLEGHEAAYVDWFLSAGTLGDGVRPALRDAFVRAYTGRQALSCAFSYYRALPKSAVQIEQAVATARLTVPTMALGARPVGAALERQLRPVTDDLTGHVIDDCGHIIPLHRPHALLALLHPFLAGEDAKAA

mmseqs2_pass_all_seqs.fasta

>zzsomethingelse
>zzsomethingelse
MPELRRVLANGVELNVALCGSGPAVLLLHGFPHTWELWTDVMADLSGRYRVIAPDLRGFGASGRAASGYDAGTLAEDAAALLAALGVSSATVVGIDAGTAPAFLLALRHPGLVRRLVVMESLLGRLPGAEDFLAEGPPWWFGFHSAAPSLAETVLEGHEAAYVDWFLSAGTLGDGVRPALRDAFVRAYTGRQALSCAFSYYRALPKSAVQIEQAVATARLTVPTMALGARPVGAALERQLRPVTDDLTGHVIDDCGHIIPLHRPHALLALLHPFLAGEDAKAA
>zzsomething
MPELRRVLANGVELNVALCGSGPAVLLLHGFPHTWELWTDVMADLSGRYRVIAPDLRGFGASGRAASGYDAGTLAEDAAALLAALGVSSATVVGIDAGTAPAFLLALRHPGLVRRLVVMESLLGRLPGAEDFLAEGPPWWFGFHSAAPSLAETVLEGHEAAYVDWFLSAGTLGDGVRPALRDAFVRAYTGRQALSCAFSYYRALPKSAVQIEQAVATARLTVPTMALGARPVGAALERQLRPVTDDLTGHVIDDCGHIIPLHRPHALLALLHPFLAGEDAKAA

MMseqs Output (for bugs)

https://gist.github.com/chasemc/c0cccd804ac0ff78291e43ae10837c42 https://gist.github.com/chasemc/d8157a581c833406f15442e8b9ee4e81

Your Environment

Conda installed: MMseqs2 Version: 13.45111 Happy to give system info if needed

chasemc commented 2 years ago

This is beyond me but it seems is might stem from here: https://github.com/soedinglab/MMseqs2/blob/f65187996c3a73b5a9f3f32d08f5de2313ca719b/src/commons/Util.cpp#L131

Is there an option to skip checking/removing these identifiers?

milot-mirdita commented 2 years ago

I'll update #565 when we have a solution.