torognes / vsearch

Versatile open-source tool for microbiome analysis
Other
658 stars 123 forks source link

Why Pairwise alignment (--allpairs_global) only support positive strand? #576

Open billzt opened 4 days ago

billzt commented 4 days ago

I have a set of sequences that are definitely homologous. However some of them are in negative strand and I don't know which of them are. Therefore I cannot use --allpairs_global directly since sequences are aligned on their plus strand only. Any other good suggestions?

torognes commented 4 days ago

Perhaps the --orient command may be useful to orient all sequences in the same direction before aligning them with the --allpairs_global command?

If all the sequences are fairly similar, I don't think you'll need many sequences in the database for the orient command to work fine.

billzt commented 4 days ago

Thank you. I will try --orient. But I still hope --allpairs_global could support --strand both in the future, the same as that in --cluster_fast.

torognes commented 4 days ago

We'll consider adding --strand both in the future.

frederic-mahe commented 1 day ago

hello @billzt

here is a possible workaround, using a tiny dataset for demonstration purpose:

>s1
AAAA
>s2
AAAT
>s3
TTTT
s1 AAAA
   |||
s2 AAAT

s1 AAAA
   ||||
s3 AAAA (reverse-complement)
FASTA_FILE=$(mktemp)
printf ">s1\nAAAA\n>s2\nAAAT\n>s3\nTTTT\n" > "${FASTA_FILE}"

(
    cat "${FASTA_FILE}"
    vsearch \
        --fastx_revcomp "${FASTA_FILE}" \
        --quiet \
        --label_suffix "_rv" \
        --fastaout -
) | \
    vsearch \
        --allpairs_global - \
        --id 0.75 \
        --iddef 1 \
        --quiet \
        --blast6out -

rm "${FASTA_FILE}"

We obtain the expected results, equivalent to a search on both strands:

s1  s3_rv   100.0   4   0   0   1   4   1   4   -1  0
s1  s2  75.0    4   1   0   1   4   1   4   -1  0
s3  s1_rv   100.0   4   0   0   1   4   1   4   -1  0

Warning: doubling the size of a dataset quadruples computation time.