torognes / vsearch

Versatile open-source tool for microbiome analysis
Other
643 stars 123 forks source link

Recover info in fasta header when using sintax #481

Closed blex-max closed 2 years ago

blex-max commented 2 years ago

Hello,

I'm using the vsearch package available from the Bioconda channel in a pipeline to get taxonomic identifications from fungal ITS sequences. To this end, I'm using SINTAX as implemented in vsearch. I'm using the UNITE USEARCH release of reference sequences as my database. This database gives SH (species hypothesis) numbers for each entry, which is what I am trying to recover. Unfortunately, there seems to be a bit of an oversight in the way the database is structured:

When a reference sequence has been identified to sequence level, the fasta header is as follows:

>UDB018521|SH1140878.08FU;tax=d:Fungi,p:Basidiomycota,c:Agaricomycetes,o:Thelephorales,f:Thelephoraceae,g:Tomentella,s:Tomentella_badia_SH1140878.08FU;

So if sintax identifies matches something to tomentella badia with sufficient support, the output taxonomy will read:

d:Fungi,p:Basidiomycota,c:Agaricomycetes,o:Thelephorales,f:Thelephoraceae,g:Tomentella,s:Tomentella_badia_SH1140878.08FU;

However, where reference sequences belong to unique, but undescribed, species, the header is instead like this:

>UDB026255|SH1140865.08FU;tax=d:Fungi,p:Basidiomycota,c:Agaricomycetes,o:Thelephorales,f:Thelephoraceae;

This is supposed to indicate that this is Thelephoraceae sp. 1, but since the SH number is not at the end, even with 1.00 support the output result misleadingly implies that sintax could only identify the query to family level.

Is there a way to get vsearch sintax to output the full header of the matching fasta in these cases? I suppose alternately I will have to try reformating the database to include SH numbers at the last given taxonomic level.

Thanks for maintaining vsearch!

frederic-mahe commented 2 years ago

hello @blex-max

The valid options for the sintax command are:

--bzip2_decompress --db --dbmask 
--fastq_ascii --fastq_qmax --fastq_qmin 
--gzip_decompress --log --no_progress 
--notrunclabels --quiet --randseed --sintax_cutoff
--strand --tabbedout --threads --wordlength

There is no option that can give you the complete headers of matched database sequences.

Here is a quick solution to add SH accessions at the end of each reference sequence (if it's missing):

HEADER1="UDB018521|SH1140878.08FU;tax=d:Fungi,p:Basidiomycota,c:Agaricomycetes,o:Thelephorales,f:Thelephoraceae,g:Tomentella,s:Tomentella_badia_SH1140878.08FU;"
HEADER2="UDB026255|SH1140865.08FU;tax=d:Fungi,p:Basidiomycota,c:Agaricomycetes,o:Thelephorales,f:Thelephoraceae;"
SEQ1="GTCGCTCCATCCGAGTGTGCTAAAAATGAGGTATGGTCAGTCTGGTCGTATCGAATTTCTAGTATGCGAGGGGGGAGAAGTCGTAACAAGGTAGCC"
SEQ2="$(rev <<< "${SEQ1}")"

printf ">%s\n%s\n>%s\n%s\n" "${HEADER1}" "${SEQ1}" "${HEADER2}" "${SEQ2}" | \
    awk 'BEGIN {FS = "[|;]"}
         {if (/^>/) {
              if (! index($3, $2)) {$3 = $3"_"$2}
              print $1"|"$2";"$3";"
              }
          else {print $0}
        }'
>UDB018521|SH1140878.08FU;tax=d:Fungi,p:Basidiomycota,c:Agaricomycetes,o:Thelephorales,f:Thelephoraceae,g:Tomentella,s:Tomentella_badia_SH1140878.08FU;
GTCGCTCCATCCGAGTGTGCTAAAAATGAGGTATGGTCAGTCTGGTCGTATCGAATTTCTAGTATGCGAGGGGGGAGAAGTCGTAACAAGGTAGCC
>UDB026255|SH1140865.08FU;tax=d:Fungi,p:Basidiomycota,c:Agaricomycetes,o:Thelephorales,f:Thelephoraceae_SH1140865.08FU;
CCGATGGAACAATGCTGAAGAGGGGGGAGCGTATGATCTTTAAGCTATGCTGGTCTGACTGGTATGGAGTAAAAATCGTGTGAGCCTACCTCGCTG
blex-max commented 2 years ago

Thanks so much for the code, very kind of you. That will do for my purposes.