leylabmpi / Struo2

Scalable creating/updating of metagenome profiling databases
MIT License
55 stars 8 forks source link

Updating Database: Error with kraken2_add_taxID #43

Open PeterCx opened 1 year ago

PeterCx commented 1 year ago

Hi Nick,

I am still having trouble updating my database with custom MAGs. I run snakemake using the following command and config file: config-update.yaml.txt

snakemake --use-conda --cores 30 --configfile config-update.yaml

I have attached the snakemake log. The issue is around "kraken2_add_taxID". snakemake_log.txt

Firstly it says missing output files (see snipet from log below). But this file is actually present. It simply unzips the file and places in genome directory.

[Sat Jan 28 16:22:48 2023] rule kraken2_add_taxID: input: /workspace/pot/peterc/Equine/Struo2/MAGs/ERR6929713_bin.678.fna.gz output: tmp/db_update_tmp/peterc/Struo2_112566273/db_update/genomes/MAG_1203_Nanosyncoccus.fna log: /workspace/pot/peterc/Equine/Struo2/Output/logs/db_update/kraken2_add_taxID/MAG_1203_Nanosyncoccus.log jobid: 459 benchmark: /workspace/pot/peterc/Equine/Struo2/Output/benchmarks/db_update/kraken2_add_taxID/MAG_1203_Nanosyncoccus.txt reason: Missing output files: tmp/db_update_tmp/peterc/Struo2_112566273/db_update/genomes/MAG_1203_Nanosyncoccus.fna wildcards: sample=MAG_1203_Nanosyncoccus resources: tmpdir=/tmp, time=59, mem_gb_pt=6

Later it is unable to produce the done files as a result and it is doomed from the get go. I have no idea how to solve it so help is appreciated. I have attached all other relevant files which may help diagnose the problem.

names.dmp.txt Sample_Table.txt

Many thanks

Kind regards,

P

nick-youngblut commented 1 year ago

What is in the log for the fail run jobs (no the snakemake log)? You are right in that the python script for that job just uncompresses the genome and adds the taxid to the sequence header(s). Is the python script failing (kraken2_rename_genome.py)?

PeterCx commented 1 year ago

The logs present in the directory - Struo2/Output/logs/db_update/kraken2_add_taxID are all completely empty. Where there is one log for each job/MAG I wanted to add to the database.

However, the file seqid2taxid.map is being generated correctly and has kraken taxids for each fasta header in each file. Also taxo.k2d.tmp is generated.

PeterCx commented 1 year ago

I re ran the code and this time it worked. I am not sure how as I did not do anything different. Or at least I can't remember making a change. Thanks for all the help and for making this tool.

PeterCx commented 1 year ago

Hi Nick,

Sorry to bother you again but I am encountering further problems with this. As per my last comment the database seemed to update successfully. However, I am unable to classify my reads using the database. See here the output from the build. Everything seems to have been succesful.

Creating sequence ID to taxonomy ID map (step 1)... Sequence ID to taxonomy ID map complete. [2.070s] Estimating required capacity (step 2)... Estimated hash table requirement: 312273650832 bytes Capacity estimation complete. [11m56.324s] Building database files (step 3)... Taxonomy parsed and converted. CHT created with 18 bits reserved for taxid. Completed processing of 7746034 sequences, 202613488062 bp Writing data to disk... complete. Database files completed. [12h3m44.879s] Database construction complete. [Total: 12h15m47.164s]

I have made several attempts to classify the reads some of which have returned the error:

Loading database information..... Killed

Other attempts have managed to load the database successfully and classify 50% - 90% of the reads but fails before finishing it results in a similar error -"Killed".

It seems to be related to these issues but there is no solution - [(https://github.com/DerrickWood/kraken2/issues/184)] [https://github.com/DerrickWood/kraken2/issues/84] There must be a problem with the database. Note that I am able to classify no problem with the original database as downloaded from [(http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release207)

It also cannot be a RAM issue. The updated database is only a small bit bigger than the original.

Also without making any changes I tried to re-build the database but immediately I encountered the same errors as the first comment in this thread.

I am not sure how to proceed. Many thanks

P