Kraken2: Host-decontamination

Kraken2

Kraken2 uses a lowest common ancestor (LCA) strategy to determine the taxonomy of the read. It achieves this by splitting the database genomes into kmers and then mapping kmers of the read to the database. The lineage that had the majority of mapped kmers is then classified as the taxonomy for said read.

Building the kraken2 database

We will use a human reference genome and the T. conura masked reference genome to build a karaken2 database. Furthermore, we will also remove contigs < 50Kb in length.

To build it we will use: Specify database folder. DBNAME="path/to/database/folder"

Download NCBI taxonomy kraken2-build --download-taxonomy --db $DBNAME

Download human reference kraken2-build --download-library human --db $DBNAME

Now, to generate the custom database using the T.conura reference, kraken2 requires:

Sequences must be in FASTA format (multi-fasta ok)
Sequence headers must contain either NCBI accession number or kraken:taxid|xxx where xxx should be replaces with the taxid.
- Example: >sequence16|kraken:taxid|32630 Adapter sequence

The

ndreey / CONURA_WGS

Kraken2: Host-decontamination #28

Kraken2

Building the kraken2 database