Open acpguedes opened 5 years ago
extractalignedregion
is doing something unexpected/wrong when it encounters more than one hit per query.
This issue didn't occur in our taxonomy module since we only give it at most one alignment. You can get around the issue with something like this:
mmseqs filterdb pbps.neighbors.rep30.ALN pbps.neighbors.rep30.ALN.top1 --extract-lines 1
mseqs extractalignedregion id_0.3cov_0.8evalue_0.001/pbps.neighbors.db id_0.3cov_0.8evalue_0.001/pbps.neighbors.db pbps.neighbors.rep30.ALN.top1 pbps.neighbors.rep30.ALN.SEQ --threads 40
...
However I am not sure this is what you want.
Hi @milot-mirdita
I was needing more than one hit per query. Actually, I was trying to find domains/regions with a high score but since I have a lot of multidomain proteins I need all high score local alignments. So I remove redundancy at 30% of identity, searched all against all, I have removed self-matches and choose hits with very low e-value. I preferred to do this externally to MMSeqs.
Now I'm clustering these regions.
Anyway, I did test those commands anyway.
mmseqs filterdb pbps.neighbors.rep30.ALN pbps.neighbors.rep30.ALN.top1 --extract-lines 1
mmseqs extractalignedregion id_0.3cov_0.8evalue_0.001/pbps.neighbors.db id_0.3cov_0.8evalue_0.001/pbps.neighbors.db pbps.neighbors.rep30.ALN.top1 pbps.neighbors.rep30.ALN.SEQ --threads 40
mmseqs cluster pbps.neighbors.rep30.ALN.SEQ pbps.neighbors.rep30.ALN.SEQ.CLU -c 0 -e 1e-10 --threads 40 --max-seqs 300 --kmer-per-seq 80 -s 7.5 -a
mmseqs createtsv pbps.neighbors.rep30.ALN.SEQ pbps.neighbors.rep30.ALN.SEQ.CLU pbps.neighbors.rep30.ALN.SEQ.CLU.tsv
Also, I did use coordinates of matches in target and extract with any other sequence manipulation tool.
mmseqs createdb pbps.neighbors.regions.fa pbps.neighbors.regions.DB
mmseqs cluster pbps.neighbors.regions.DB pbps.neighbors.regions.CLU /tmp/ --threads 40 -c 0.8 --min-seq-id 0.8
mmseqs createtsv pbps.neighbors.regions.DB pbps.neighbors.regions.CLU pbps.neighbors.regions.CLU.tsv --threads 40
mmseqs result2repseq pbps.neighbors.regions.DB pbps.neighbors.regions.CLU pbps.neighbors.regions.CLU.REP
ln -s pbps.neighbors.regions.DB_h pbps.neighbors.regions.CLU.REP_h
ln -s pbps.neighbors.regions.DB_h.index pbps.neighbors.regions.CLU.REP_h.index
mmseqs cluster pbps.neighbors.regions.CLU.REP pbps.neighbors.regions.CLU.REP.CLU tmp -c 0 -e 1e-10 --threads 40 --max-seqs 300 --kmer-per-seq 80 -s 7.5 -a
mmseqs createtsv pbps.neighbors.regions.CLU.REP pbps.neighbors.regions.CLU.REP.CLU pbps.neighbors.regions.CLU.REP.CLU.tsv --threads 40
mmseqs createseqfiledb pbps.neighbors.regions.CLU.REP pbps.neighbors.regions.CLU.REP.CLU pbps.neighbors.regions.CLU.REP.CLU.FADB --threads 40
mmseqs apply pbps.neighbors.regions.CLU.REP.CLU.FADB pbps.neighbors.regions.CLU.REP.CLU.FADB.MSA --threads 40 -- clustalo -i - --threads=1 --iter=3
mmseqs createseqfiledb pbps.neighbors.regions.CLU.REP pbps.neighbors.regions.CLU.REP.CLU pbps.neighbors.regions.CLU.REP.CLU.FADB --threads 40 --min-sequences 10
There are other strategies using mmseqs to find a conserved region (domains) in a set of proteins?
Your initial idea is sound. The only problem is that extractalignedregion
does not really support it since it was built for one very specific purpose and should be generalized. I started working on that though I don't quite have time to finish it up.
I have an idea for a hack though:
mmseqs prefixid aln aln.tsv --tsv
awk '{ $1 = NR++; print $0; }' aln.tsv > aln_new.tsv
mmseqs tsv2db aln_new.tsv aln_new --output-dbtype 5
mmseqs extractalignedregion query target aln_new target_new --threads 1
awk '{ print $2; }' aln.tsv > target_ids.tsv
mmseqs createsubdb target_ids.tsv target_h target_new_h
sort -k2,2 target_new_h.index | awk '{ $1 = NR++; print $0; }' > target_new_h.index_new
mv -f target_new_h.index_new target_new_h.index
mmseqs cluster target_new clu tmp
I wrote this down without testing, but this should create a valid domain sequence database from the alignment result.
If you are interested in actually fixing the extractalignedregion
module in the MMseqs2 codebase then I could also try guiding you through that.
@milot-mirdita this week I'm quite busy finishing up my writing. But sure I'm interested in fixing the extractalignedregion
module. Actually, I'm glad about this idea.
I corrected a few mistakes in the hack above to generate a valid new sequence database that works with extractalignedregion
.
Hi, I think there is some inconsistency in
extractalignedregion
module to generate database with regions.I'm trying to cluster aligned regions belonging to a db.
But it fails in the last step with this message
I think it is happening because of the number of files generated by the
extractalignedregion
module. If it isn't, how do I get the aligned regions to use in a search? I'm trying to get aligned regions to cluster them to find possible domains in a set of protein.Thanks in advance.