Closed adimil closed 6 years ago
We do not have a tool to do an all against all alignment of each cluster. This could be quite computationally expensive if the clusters are big. I also do not have an efficient solution for this problem yet. I will think a bit about it.
I think this a somewhat basic question but the original question is exactly what I want, how would one get the alignment between clustered sequences and the representative of the cluster?
Starting with...
mmseqs linclust input_DB output_DB tmp
I'd just want to get the alignments across all genes in input_DB with the representatives of clusters in output_DB. Sorry this is basic but I'm not sure if it's a task for align/convertalis/result2flat and haven't been able to figure it out.
I think you could just extract the representatives and then run align again but it seems like there should be a more efficient way.
@tomblaze You can just call the alignment module
mmseqs linclust input_DB output_DB tmp
mmseqs align input_DB input_DB output_DB output_aln_DB --max-seqs 1000000 -e 10000 -a
If you want the result as BLAST tab format call:
mmseqs convertalis input_DB input_DB output_aln_DB output_aln_DB.m8
Hi sorry for hijacking the issue but this is also interesting for us... we are using PARASAIL (https://github.com/jeffdaily/parasail/) to get an all-vs-all in each cluster. In our case we are interested in semi-global alignments. @martin-steinegger do you think we can obtain similar results with the align module playing with the coverage values? We would like to stay within the MMSEQS2 ecosystem.
@genomewalker yes I belive that you can get a similar result using the coverage parameters.
@adimil @genomewalker I am intending to build a all-against-all search as module in MMseqs2 soon. I will update you if it is ready.
Sorry for hijacking the issue again. kClust, the predecessor of MMseqs2, has a utility called kClust_mkAln (https://github.com/soedinglab/kClust). kClust_mkAln calls either clustalo or kalign to perform multiple sequence alignment (i.e. all-against-all, not all-against-representative) within each cluster. Is it possible to do similar thing in MMseqs2?
If a center-star alignment is good enough, you can use the result2msa
module.
You can use the apply
module to execute any external program.
We do this step for building the Uniclust databases. Though slightly more involved to handle some edge cases of the alignments tools and with the equivalent tool to apply
from the ffindex
package.
See: https://github.com/soedinglab/uniclust-pipeline/blob/develop/hhdatabase/make_hhdatabase.sh#L2 https://github.com/soedinglab/uniclust-pipeline/blob/develop/hhdatabase/make_a3m.sh#L13 https://github.com/soedinglab/uniclust-pipeline/blob/develop/hhdatabase/fasta_to_msa_a3m.sh#L1
@adimil @genomewalker
I have added an all against all alignment module called alignall
.
You can call it with any kind of result database
mmseqs alignall queryDB targetDB resultDB alignallDB
The output format (in the case of clustering queryID1 = clusterID1):
queryID1 targetId1 targetId2 alnScore seqIdentity eVal qStart qEnd qLen tStart tEnd tLen [alnCigar]
To convert the result back to a flat file please use converttsv
.
mmseqs convertsv queryDB targetDB alignallDB alignallDB.tsv
However the conversion to a flat file format is not yet working properly. The targetId2 column is not converted back to its alpha numerical representation. This information can be added by using awk and the targetDB.lookup
e.g.
OFS="\t" awk 'FNR==NR{a[$1]=$2;next}{$3=a[$3]; print}' targetDB.lookup alignallDB.tsv > alignallDB_newid.tsv
@martin-steinegger great! We will test it and let you know if we encounter any problems.
Many thanks!
Thanks @martin-steinegger ! Hope to try it out soon.
After clustering I want to see the alignments of all sequences within a cluster, but using the align command I only get the alignment of each member to the representative. Is there a way to see the alignment of all members of the cluster with each other?
Thanks