Closed chasemc closed 2 years ago
This is beyond me but it seems is might stem from here: https://github.com/soedinglab/MMseqs2/blob/f65187996c3a73b5a9f3f32d08f5de2313ca719b/src/commons/Util.cpp#L131
Is there an option to skip checking/removing these identifiers?
I'll update #565 when we have a solution.
Expected Behavior
The two fasta files depicted below are identical except for the deflines:
pass.fasta
fail.fasta
Current Behavior / Steps to Reproduce (for bugs)
Running easy-cluster on these two files:
results in the correct output for
mmseqs2_pass_cluster.tsv
:but removes the 'uc' from the defline in
mmseqs2_fail_cluster.tsv
This seems to be the case for any deflines that start with 'uc'
The FASTA files also have duplicate defline entries, where one of the duplicates doesn't contain a sequence:
mmseqs2_fail_all_seqs.fasta
mmseqs2_pass_all_seqs.fasta
MMseqs Output (for bugs)
https://gist.github.com/chasemc/c0cccd804ac0ff78291e43ae10837c42 https://gist.github.com/chasemc/d8157a581c833406f15442e8b9ee4e81
Your Environment
Conda installed: MMseqs2 Version: 13.45111 Happy to give system info if needed