Open billzt opened 4 days ago
Perhaps the --orient
command may be useful to orient all sequences in the same direction before aligning them with the --allpairs_global
command?
If all the sequences are fairly similar, I don't think you'll need many sequences in the database for the orient
command to work fine.
Thank you. I will try --orient
. But I still hope --allpairs_global
could support --strand both
in the future, the same as that in --cluster_fast
.
We'll consider adding --strand both
in the future.
hello @billzt
here is a possible workaround, using a tiny dataset for demonstration purpose:
>s1
AAAA
>s2
AAAT
>s3
TTTT
s1
has 75% similarity with s2
,s1
has 100% similarity with s3
(if s3
is reverse-complemented)s1 AAAA
|||
s2 AAAT
s1 AAAA
||||
s3 AAAA (reverse-complement)
--fastx_revcomp
to reverse-complement the dataset,--allpairs_global
to find matching pairs,FASTA_FILE=$(mktemp)
printf ">s1\nAAAA\n>s2\nAAAT\n>s3\nTTTT\n" > "${FASTA_FILE}"
(
cat "${FASTA_FILE}"
vsearch \
--fastx_revcomp "${FASTA_FILE}" \
--quiet \
--label_suffix "_rv" \
--fastaout -
) | \
vsearch \
--allpairs_global - \
--id 0.75 \
--iddef 1 \
--quiet \
--blast6out -
rm "${FASTA_FILE}"
We obtain the expected results, equivalent to a search on both strands:
s1 s3_rv 100.0 4 0 0 1 4 1 4 -1 0
s1 s2 75.0 4 1 0 1 4 1 4 -1 0
s3 s1_rv 100.0 4 0 0 1 4 1 4 -1 0
Warning: doubling the size of a dataset quadruples computation time.
I have a set of sequences that are definitely homologous. However some of them are in negative strand and I don't know which of them are. Therefore I cannot use
--allpairs_global
directly since sequences are aligned on their plus strand only. Any other good suggestions?