nf-core / test-datasets

Test data to be used for automated testing with the nf-core pipelines
https://nf-co.re
MIT License
105 stars 353 forks source link

Add SARS-CoV2 KrakenUniq database files #1192

Closed jfy133 closed 6 months ago

jfy133 commented 7 months ago

And apparently includes some prettier formatting...?

Built with the following steps

  1. Downloaded from sarscov2 and `prokryote/metagenome ready:
    1. FASTA files you want to include
    2. A taxonomy .map file that contains `<FASTA_ACCESSION_ID>\tTAX_ID>
    3. nodes.dmp taxonomy file
    4. names.dmp taxonomy file
  2. Make a directory named with the name the database will be called
  3. Within this directory, make two further directories
    • library/
    • taxonomy/
  4. Within these two directories place the following
    • library/: All fastas and seq map file
    • taxonomy/: the two *.dmp files
  5. Run krakenuniq-build --db <db_name>
    • You may need to include --jellyfish-bin $(type -P -a jellyfish) if you get an unbound variable warning
  6. If successful, should have a bunch of files in <db_name> like .kbd, .jdb, .tsv library-fioles.txt, taxDB etc.
  7. Run a test: krakenuniq --db customdb_krakenuniq_mini/ --threads 4 --quick test_1.fastq.gz
  8. If all good Run cleanup: rm -r *log *jdb *counts *tsv *txt *map libraries/ taxonomy

Tested by running both FASTA and FASTQ files:

$ krakenuniq --db krakenuniq/ --threads 4 --quick contigs.fasta
/home/james/bin/miniconda3/envs/krakenuniq/share/krakenuniq-1.0.4-1/libexec/classify -d krakenuniq//database.kdb -i krakenuniq//database.idx -t 4 -q -a krakenuniq//taxDB -p 12
 Database krakenuniq//database.kdb
Loaded database with 29799 keys with k of 31 [val_len 4, key_len 8].
Reading taxonomy index from krakenuniq//taxDB. Done.
C   NODE_1_length_20973_cov_191.628754  2697049 20973   Q:1
C   NODE_2_length_8774_cov_178.827802   2697049 8774    Q:1
U   NODE_3_length_5473_cov_3.280771 0   5473    Q:0
U   NODE_4_length_458_cov_1.051360  0   458 Q:0
C   NODE_5_length_446_cov_1.078370  2697049 446 Q:1
5 sequences (0.04 Mbp) processed in 0.016s (18.7 Kseq/m, 135.41 Mbp/m).
  3 sequences classified (60.00%)
  2 sequences unclassified (40.00%)
$ krakenuniq --db krakenuniq/ --threads 4 --quick test_1.fastq.gz
<...>
C   ERR5069949.1476386  2697049 151 Q:1
C   ERR5069949.2415814  2697049 150 Q:1
100 sequences (0.01 Mbp) processed in 0.003s (1772.5 Kseq/m, 246.33 Mbp/m).
  100 sequences classified (100.00%)
  0 sequences unclassified (0.00%)