knights-lab / BURST

An ultrafast optimal aligner for mapping large NGS data to large genome databases.
GNU Affero General Public License v3.0
56 stars 8 forks source link

using burst with NCBI nt database? #34

Open FabianRoger opened 3 years ago

FabianRoger commented 3 years ago

I just came across the preprint and got curious to try out burst.

I need to assigning the taxonomy for Illumina short-reads (MiSeq, up to ~450bp) from an amplicon sequencing run (COI & 16S). Are there instructions on how to format the EMBL/NCBI nt database for burst to make a lowest-common ancestor assignment? Or is this not the intended use-case?

Thanks!

Fabian

GabeAl commented 3 years ago

I'm glad you asked! The easiest way to do this is to download all the 16S sequences accumulated by the targeted loci project (TLP) and a comparable database for the COI (such as the https://ftp.ncbi.nlm.nih.gov/refseq/release/mitochondrion/ database which contains this gene (cox1) as well as all mitochondrial genes).

The 16S TLP from NCBI is found here: ftp://ftp.ncbi.nlm.nih.gov/refseq/TargetedLoci/Bacteria/bacteria.16SrRNA.fna.gz (also available for achaea, highly recommended to get that one too in ftp://ftp.ncbi.nlm.nih.gov/refseq/TargetedLoci/Archaea/).

Then run each sequence through linfasta to linearize them (it's available in the burst tools directory), and then run the taxonomizer programs to get the Greengenes-like taxonomy.

Then when you run BURST in capitalist mode, you will get the LCA for each read in column 13 and the "capitalist-picked" single match in columns 1 and

  1. (Yes, "capitalist" mode does both LCA AND the capitalist disambiguation).

A guide is available here for full genomes (just replace the content with the linearized individual targeted loci or mitochondrial genes from above). https://github.com/knights-lab/BURST/blob/master/embalmlets/bin/Readme_utils.txt

Be sure to build the burst database with sufficiently large regions (i.e. -d DNA 500 -s 1700) to allow the full stitched query to map.

Cheerio, Gabe

On Mon, Jun 7, 2021 at 5:32 AM FabianRoger @.***> wrote:

I just came across the preprint and got curious to try out burst.

I need to assigning the taxonomy for Illumina short-reads (MiSeq, up to ~450bp) from an amplicon sequencing run (COI & 16S). Are there instructions on how to format the EMBL/NCBI nt database for burst to make a lowest-common ancestor assignment? Or is this not the intended use-case?

Thanks!

Fabian

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/knights-lab/BURST/issues/34, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5NOBVQ7PSSZ73LRLXUCGDTRSG2XANCNFSM46HIGG2A .

FabianRoger commented 3 years ago

Thanks for these helpful instructions!

quick question: I didn't made it clear, but the 16S was for invertebrates, too, so I don't think the targetedloci database will cover it? And do you know if the mitochondrion database contains also partial COI genes (such as all the folmer regions from BOLD) or is that only from partial / full genomes?

Either way I guess I can start with a custom reference database generated with ecoPCR from OBITools I guess and format that for BURST. Thanks again!

Fabian