soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.36k stars 190 forks source link

1:1 alignment for thousands of sequence pairs, not pairwise. #813

Open Paupiera opened 6 months ago

Paupiera commented 6 months ago

Dear mmseqs2 developvers,

I have a list of thousands of subject - query fasta pairs, and I would like to run mmseqs to align each of these pairs in a way that each sequence is only aligned to its pair. This is how my list looks:

S10CNODE_1.fasta       S1CNODE_25.fasta
S10CNODE_2.fasta       S2CNODE_8.fasta 
S10CNODE_3.fasta       S5CNODE_11.fasta   
S10CNODE_4.fasta       S3CNODE_7.fasta
S10CNODE_5.fasta       S6CNODE_10.fasta

Is there an efficient way to perform these 1:1 alignments? Could I create a database that contains all sequences and then align a database subentry?

I am trying to avoid aligning all against all.

milot-mirdita commented 6 months ago

This is possible but a bit tricky:

Please make one FASTA file containing all sequence entries. Then call createdb

mmseqs createdb YOUR_FASTA_FILE.fasta db

Then take a look at the db.lookup file it generated. You will find entries in the format:

numeric_key accession set_id
...

The first two columns are important, you can ignore the last. You will need to make a new TSV file, of the keys of the two matching accessions.

In your example, you should see something like the following in the db.lookup.

0 S10CNODE_1.fasta
1 S10CNODE_2.fasta
2 S10CNODE_3.fasta
3 S10CNODE_4.fasta
4 S10CNODE_5.fasta
5 S1CNODE_25.fasta
6 S2CNODE_8.fasta 
7 S5CNODE_11.fasta   
8 S3CNODE_7.fasta
9 S6CNODE_10.fasta

The new tsv file you need to create should look like this:

0 5
1 6
2 7
3 8
4 9
...

Next sort this file according to the first column:

sort -k1,1n NEW_TSV_FILE.tsv  > NEW_TSV_FILE_sort.tsv 

Now you can pass this file to tsv2db and compute the alignments and generate a normal MMseqs2 m8 output:

# 6 is a clustering database, which only requires a target key and nothing else
mmseqs tsv2db NEW_TSV_FILE_sort.tsv clu_db --output-dbtype 6
mmseqs align db db clu_db aln_db -a
mmseqs convertalis db db aln_db aln.m8
milot-mirdita commented 6 months ago

Oh, also this will only work for protein sequences. Nucleotide sequences need a diagonal seed point to compute the alignment, which would be much more difficult to hack.

Paupiera commented 6 months ago

Thank you so much for your fast reply. Unfortunately, my sequences are nucleotide. Would be of any help if I provide the seed points?

milot-mirdita commented 6 months ago

You can try.

The tsv you need to create has the same format of, but with two more columns score (can be 0) and match diagonal (position i-j):

query_key target_key score(0) diagonal

Then call:

# 7 is a prefiltering database
mmseqs tsv2db NEW_TSV_FILE_sort.tsv pref_db --output-dbtype 7
mmseqs align db db pref_db aln_db -a
mmseqs convertalis db db aln_db aln.m8

In either case, you can't mix nucleotides and protein-pairs in one run, needs to be either or.

Paupiera commented 6 months ago

What criteria should I follow to define the score? Is the format of diagonal i-j being i: seed start position in query and j: start position in target? Also, for each row, is it possible to set more than one seed/diagonal position?

milot-mirdita commented 6 months ago

You can ignore the score, its not used further in the align module. Just set it to 0.

You can pass only one diagonal per query-target pair. You should be able to create multiple lines per query target pair with different diagonals though.

The seed point refers to the seed start positions i and j in query and target, respectively.

Just out of curiosity: Can I ask what your use-case for this is?

Paupiera commented 6 months ago

Of course. We are developing a method that, given a set of dna sequences from different environmental samples, generates pairs of candidates across samples, suggesting they could be generated from the same source. In this case, we would like to curate the candidates through alignment.