torognes / vsearch

Versatile open-source tool for microbiome analysis
Other
643 stars 123 forks source link

Some questions about "Extraction options" #488

Closed Statistic-Qin closed 1 year ago

Statistic-Qin commented 2 years ago

After a series of treatment,I get the last fa-file which title has the "size=x"(picture 1).And I get the otu table file. I use linux options get the otu id names and save them in a text file(picture 2). I want to use the --fastx_getseqs function, but the entire header must match the id names, so I get zero sequence. When I add the --label_substr_match function, I find I get lots of sequences... Is there any way to solve this problem? image image image

torognes commented 2 years ago

There are a couple of ways to handle this.

You could modify the output to remove the ;size= annotation with the option --xsize in the command that writes the original file. Then you should drop the label_substr_match option.

Perhaps you could alternatively use the label_word option with fastx_getseqs, like this:

vsearch --fastx_getseqs input.fasta --fastaout output.fasta --label_words otus.txt

The otus.txt file should then contain the OTU labels (e.g. OTU_2), one per line.

I hope this helps.

frederic-mahe commented 2 years ago

Assuming a fasta input:

>s1;size=2;
AAAAA
>s2;size=1;
AAAAT

vsearch can read stdin and write to stdout, so it is possible to chain vsearch operations as such:

printf ">s1;size=2;\nAAAAA\n>s2;size=1;\nAAAAT\n" | \
    vsearch \
        --fastx_filter - \
        --quiet \
        --xsize \
        --fastaout - | \
    vsearch \
        --fastx_getseqs - \
        --quiet \
        --fastaout - \
        --label_word "s1"
>s1
AAAAA
Statistic-Qin commented 2 years ago

Thanks! In the first, I delete the sizeout option, there is no ";size=" string.

frederic-mahe commented 2 years ago

Perhaps you could alternatively use the label_word option

As suggested by @torognes the label_word option is the best way to match labels (headers without annotations):

--label_word string Specify a word to match in the sequence header. Words are defined as strings delimited by either the start or end of the header or by any symbol that is not a letter (A-Z, a-z) or digit (0-9). The comparison is case-sensitive.

I've added regression tests to the vsearch test suite https://github.com/frederic-mahe/vsearch-tests/commit/6a52f32f54f8a985bb9d63170757db0e008e9a13

@Statistic-Qin please close the issue if your problem has been solved.