torognes / vsearch

Versatile open-source tool for microbiome analysis
Other
643 stars 123 forks source link

Convert Qiime2 database (2 files) into fasta database (1 file) for taxonomic assignment in vsearch #553

Closed timz0605 closed 2 months ago

timz0605 commented 4 months ago

Hello!

I am working on a COI metabarcoding project for animals. I currently have two database files, one fasta file containing all sequences and the other txt file containing taxonomic information for all sequences. Those files contain local barcodes created by lab mates in previous projects and they were using Qiime2 for analyses, which require 2 files as the database for taxonomic assignment. However, as for vsearch, it only requires one fasta file for taxonomic assignment. I was wondering if there are any commands or programs that could help do the conversion of 2 files into 1 fasta file?

Thank you!

frederic-mahe commented 4 months ago

hello @timz0605 there are no vsearch command to merge separated sequences and taxonomic assignments into a single fasta file.

Without knowing the exact layout of your input files, it is difficult to give you a more precise answer. When faced with a similar task, I usually combine paste, sort, join and sed to produce a fasta file.

frederic-mahe commented 2 months ago

Here is an example using the command line listed above. Assuming the following layout for the taxonomic assignments and the fasta file:

s2  kingdom;genus;species2
s1  kingdom;genus;species1
>s1
ACGT
>s2
TGCA
join -j 1 \
    <(printf "s2\tkingdom;genus;species2\ns1\tkingdom;genus;species1\n" | sort -k1,1)
    <(printf ">s1\nACGT\n>s2\nTGCA\n" | paste - - | tr -d ">" | sort -k1,1) | \
    sed 's/^/>/ ; s/ /\n/2'

Sequences and taxonomic assignments are now merged:

>s1 kingdom;genus;species1
ACGT
>s2 kingdom;genus;species2
TGCA

In the code above, I use printf to generate input data. Most likely, you have input files:

join -j 1 \
    <(sort -k1,1 input.taxonomy)
    <(paste - - < input.fasta | tr -d ">" | sort -k1,1) | \
    sed 's/^/>/ ; s/ /\n/2'