Open Paupiera opened 6 months ago
This is possible but a bit tricky:
Please make one FASTA file containing all sequence entries. Then call createdb
mmseqs createdb YOUR_FASTA_FILE.fasta db
Then take a look at the db.lookup
file it generated.
You will find entries in the format:
numeric_key accession set_id
...
The first two columns are important, you can ignore the last. You will need to make a new TSV file, of the keys of the two matching accessions.
In your example, you should see something like the following in the db.lookup
.
0 S10CNODE_1.fasta
1 S10CNODE_2.fasta
2 S10CNODE_3.fasta
3 S10CNODE_4.fasta
4 S10CNODE_5.fasta
5 S1CNODE_25.fasta
6 S2CNODE_8.fasta
7 S5CNODE_11.fasta
8 S3CNODE_7.fasta
9 S6CNODE_10.fasta
The new tsv file you need to create should look like this:
0 5
1 6
2 7
3 8
4 9
...
Next sort this file according to the first column:
sort -k1,1n NEW_TSV_FILE.tsv > NEW_TSV_FILE_sort.tsv
Now you can pass this file to tsv2db
and compute the alignments and generate a normal MMseqs2 m8 output:
# 6 is a clustering database, which only requires a target key and nothing else
mmseqs tsv2db NEW_TSV_FILE_sort.tsv clu_db --output-dbtype 6
mmseqs align db db clu_db aln_db -a
mmseqs convertalis db db aln_db aln.m8
Oh, also this will only work for protein sequences. Nucleotide sequences need a diagonal seed point to compute the alignment, which would be much more difficult to hack.
Thank you so much for your fast reply. Unfortunately, my sequences are nucleotide. Would be of any help if I provide the seed points?
You can try.
The tsv you need to create has the same format of, but with two more columns score (can be 0) and match diagonal (position i-j):
query_key target_key score(0) diagonal
Then call:
# 7 is a prefiltering database
mmseqs tsv2db NEW_TSV_FILE_sort.tsv pref_db --output-dbtype 7
mmseqs align db db pref_db aln_db -a
mmseqs convertalis db db aln_db aln.m8
In either case, you can't mix nucleotides and protein-pairs in one run, needs to be either or.
What criteria should I follow to define the score?
Is the format of diagonal i-j
being i: seed start position in query and j: start position in target? Also, for each row, is it possible to set more than one seed/diagonal position?
You can ignore the score, its not used further in the align
module. Just set it to 0.
You can pass only one diagonal per query-target pair. You should be able to create multiple lines per query target pair with different diagonals though.
The seed point refers to the seed start positions i and j in query and target, respectively.
Just out of curiosity: Can I ask what your use-case for this is?
Of course. We are developing a method that, given a set of dna sequences from different environmental samples, generates pairs of candidates across samples, suggesting they could be generated from the same source. In this case, we would like to curate the candidates through alignment.
Dear mmseqs2 developvers,
I have a list of thousands of subject - query fasta pairs, and I would like to run mmseqs to align each of these pairs in a way that each sequence is only aligned to its pair. This is how my list looks:
Is there an efficient way to perform these 1:1 alignments? Could I create a database that contains all sequences and then align a database subentry?
I am trying to avoid aligning all against all.