soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
MIT License
1.47k stars 200 forks source link

Alignment of clusters #60

Closed adimil closed 6 years ago

adimil commented 6 years ago

After clustering I want to see the alignments of all sequences within a cluster, but using the align command I only get the alignment of each member to the representative. Is there a way to see the alignment of all members of the cluster with each other?

Thanks

martin-steinegger commented 6 years ago

We do not have a tool to do an all against all alignment of each cluster. This could be quite computationally expensive if the clusters are big. I also do not have an efficient solution for this problem yet. I will think a bit about it.

tomblaze commented 6 years ago

I think this a somewhat basic question but the original question is exactly what I want, how would one get the alignment between clustered sequences and the representative of the cluster?

Starting with... mmseqs linclust input_DB output_DB tmp

I'd just want to get the alignments across all genes in input_DB with the representatives of clusters in output_DB. Sorry this is basic but I'm not sure if it's a task for align/convertalis/result2flat and haven't been able to figure it out.

I think you could just extract the representatives and then run align again but it seems like there should be a more efficient way.

martin-steinegger commented 6 years ago

@tomblaze You can just call the alignment module

 mmseqs linclust input_DB output_DB tmp
 mmseqs align input_DB input_DB output_DB output_aln_DB  --max-seqs 1000000 -e 10000 -a

If you want the result as BLAST tab format call:

 mmseqs convertalis input_DB input_DB output_aln_DB output_aln_DB.m8
genomewalker commented 6 years ago

Hi sorry for hijacking the issue but this is also interesting for us... we are using PARASAIL (https://github.com/jeffdaily/parasail/) to get an all-vs-all in each cluster. In our case we are interested in semi-global alignments. @martin-steinegger do you think we can obtain similar results with the align module playing with the coverage values? We would like to stay within the MMSEQS2 ecosystem.

martin-steinegger commented 6 years ago

@genomewalker yes I belive that you can get a similar result using the coverage parameters.

@adimil @genomewalker I am intending to build a all-against-all search as module in MMseqs2 soon. I will update you if it is ready.

kad-ecoli commented 6 years ago

Sorry for hijacking the issue again. kClust, the predecessor of MMseqs2, has a utility called kClust_mkAln (https://github.com/soedinglab/kClust). kClust_mkAln calls either clustalo or kalign to perform multiple sequence alignment (i.e. all-against-all, not all-against-representative) within each cluster. Is it possible to do similar thing in MMseqs2?

milot-mirdita commented 6 years ago

If a center-star alignment is good enough, you can use the result2msa module.

You can use the apply module to execute any external program.

We do this step for building the Uniclust databases. Though slightly more involved to handle some edge cases of the alignments tools and with the equivalent tool to apply from the ffindex package.

See: https://github.com/soedinglab/uniclust-pipeline/blob/develop/hhdatabase/make_hhdatabase.sh#L2 https://github.com/soedinglab/uniclust-pipeline/blob/develop/hhdatabase/make_a3m.sh#L13 https://github.com/soedinglab/uniclust-pipeline/blob/develop/hhdatabase/fasta_to_msa_a3m.sh#L1

martin-steinegger commented 6 years ago

@adimil @genomewalker
I have added an all against all alignment module called alignall. You can call it with any kind of result database

mmseqs alignall queryDB targetDB resultDB alignallDB

The output format (in the case of clustering queryID1 = clusterID1):

queryID1 targetId1 targetId2 alnScore  seqIdentity  eVal  qStart  qEnd  qLen  tStart  tEnd  tLen  [alnCigar]

To convert the result back to a flat file please use converttsv.

mmseqs convertsv queryDB targetDB alignallDB alignallDB.tsv

However the conversion to a flat file format is not yet working properly. The targetId2 column is not converted back to its alpha numerical representation. This information can be added by using awk and the targetDB.lookup e.g.

OFS="\t" awk 'FNR==NR{a[$1]=$2;next}{$3=a[$3]; print}' targetDB.lookup alignallDB.tsv > alignallDB_newid.tsv
genomewalker commented 6 years ago

@martin-steinegger great! We will test it and let you know if we encounter any problems.

Many thanks!

adimil commented 6 years ago

Thanks @martin-steinegger ! Hope to try it out soon.