leoisl / bacterial_assembly_pipeline

MIT License
8 stars 0 forks source link

Kraken2 experiments #5

Closed leoisl closed 4 months ago

leoisl commented 8 months ago

Running Kraken2 on 2 million samples with a 256-GB DB with our infrastructure is more challenging than I previously thought, here I describe some (incomplete) preliminary tests and benchmarks.

Kraken2 preliminary test

Kraken2 sample command:

kraken2 --db /nfs/research/zi/db/k2_gtdb_r214/gtdb_r214/256gb --threads 1 --report DRR122446.kreport2 --output DRR122446.kraken2 --paired DRR122446_1.fastq.gz DRR122446_2.fastq.gz

Some preliminary benchmarking:

Bracken preliminary test

Bracken sample command:

bracken -d /nfs/research/zi/db/k2_gtdb_r214/gtdb_r214/256gb/ -i SAMD00108761.kreport2 -o SAMD00108761.bracken -r 150

Bracken is very quick and low RAM to run (takes 3 seconds and 100 MB of RAM). It also does not need the reads, just the kraken report, so I am not running it on this heavy pipeline.

Disk consumption

As we are working with 2M samples and with limited disk space, is important to benchmark the disk consumption of the output files. As we can see below, all outputs produced by Kraken/Bracken are small (a couple of MBs), except for the Kraken output, which takes 1.2 GB:

-rw-rw-r-- 1 leandro iqbal 2.3M Oct 24 13:02 SAMD00108761.kreport2
-rw-rw-r-- 1 leandro iqbal 1.2G Oct 24 13:02 SAMD00108761.kraken2
-rw-rw-r-- 1 leandro iqbal 1.4M Oct 24 13:13 SAMD00108761.bracken
-rw-rw-r-- 1 leandro iqbal 2.2M Oct 24 13:13 SAMD00108761_bracken_species.kreport2

We should however compress these outputs:

-rw-rw-r-- 1 leandro iqbal 506K Oct 24 13:02 SAMD00108761.kreport2.gz
-rw-rw-r-- 1 leandro iqbal 208M Oct 24 13:02 SAMD00108761.kraken2.gz
-rw-rw-r-- 1 leandro iqbal 365K Oct 24 13:13 SAMD00108761.bracken.gz
-rw-rw-r-- 1 leandro iqbal 498K Oct 24 13:13 SAMD00108761_bracken_species.kreport2.gz

So all files take <0.5MB except for the kraken2 output, which takes ~200MB. This is a single sample. If we are optimistic and extrapolate ~100MB of output per sample, we will have a disk usage of ~2,000,000 * 100 MB = 190 TBs for the Kraken2 output. It could be more or less than this, but for sure will be on the order of hundreds of TBs. We might get a better idea with the upcoming small benchmark on 100 samples. This heavy kraken2 output contains the classification for each read, e.g.:

C       DRR122446.1     20190   151|151 A:3 0:5 3682:1 20190:1 3682:2 A:105 |:| 0:12 A:43 0:6 3:2 0:15 A:39 
U       DRR122446.2     0       151|151 A:3 0:1 3:1 0:7 A:105 |:| 0:12 A:43 0:23 A:39 
C       DRR122446.3     20190   151|151 A:3 20190:11 3:6 0:3 3:3 0:4 3682:8 0:5 3682:2 3:1 0:4 3682:2 0:13 3682:5 A:47 |:| 0:12 A:34 0:57 A:14 
C       DRR122446.4     3       151|151 0:3 3:4 0:5 3:10 0:1 3:5 0:4 3:11 0:4 3:2 0:5 3:15 0:6 3:5 0:5 A:32 |:| 0:12 A:34 0:57 A:14 
U       DRR122446.5     0       151|151 A:3 0:9 A:105 |:| A:55 0:23 A:39 
C       DRR122446.6     3       151|151 3:7 0:2 3:5 0:4 3:2 0:1 3:6 0:1 3:1 0:6 3:7 0:5 3:11 0:4 3:12 0:5 3:9 A:29 |:| 3:6 0:3 3:8 0:10 3:3 0:87
C       DRR122446.7     3       151|151 3:1 0:17 3:1 0:8 3:2 0:1 3:5 0:6 3:6 0:6 3:3 0:1 3:1 0:10 3:2 0:5 3:7 0:6 A:29 |:| 0:2 3:11 0:2 3:5 0:13 3:1 0:5 3:5 0:7 3:1 0:65
C       DRR122446.8     156     151|151 3:8 0:5 3:2 0:2 3:5 0:8 3:1 0:4 3:16 0:3 3:1 0:6 3:5 0:5 3:5 0:12 A:29 |:| 0:23 156:4 0:8 156:2 0:1 156:5 0:74

Bracken don't use this file at all, it instead uses just the report. So I am thinking on saving hundreds of TB of disk by simply not storing this heavy Kraken2 file, but just storing the report.

The issue: how to scale these runs?

The 256GB DB takes 18 mins to load, and we just need 3 mins to process a sample. It is too much time spent on loading data from the disk, and very few processing samples. We also need to ask for high-RAM jobs (e.g. ~275 GB), so I think we will be able to have just a few jobs (~200) running simultaneously. While kraken2 can process several read files (as much as you want), it won't create a correct report - it assumes all reads come from a single sample. We indeed need to run one Kraken2 execution per sample. There are some solutions to this:

  1. Keep the index in RAM in the worker node (/dev/shm) and use the kraken2 --memory-mapping param.
    • While this works and is the solution that will be the fastest, it won't work on all nodes. In nodes with 376 GB of RAM, we have only 189 GB available for /dev/shm:
      $ df -h /dev/shm/
      Filesystem      Size  Used Avail Use% Mounted on
      tmpfs           189G   31G  158G  17% /dev/shm

For bigmem nodes, we have 1 TB, so this could be a solution. The idea is then to copy the DB once from /nfs to /dev/shm and process as much samples as possible. This is what is done here: https://github.com/DerrickWood/kraken2/issues/287#issuecomment-1380667069

  1. Keep the index in a fast SSD (/lscratch).

    • Same idea as 1., but now we use a fast SSD (/lscratch). Many nodes have ~500GB of /lscratch, so we are not restricted to big-mem nodes only. The idea is that we copy the 256-GB index once to the /lscratch dir and run several kraken2 runs there. It is possible to either load the whole index from the SSD to RAM (in which case we will need to ask for >300GB RAM nodes) or to simply use --memory-mapping (slower but RAM requirements is basically none. I am benchmarking how slower it is);
  2. Create a fork of Kraken2 and modify it so that it can process several samples in a single run, instead of just one. This is not hard and was done before, see https://github.com/DerrickWood/kraken2/issues/233#issuecomment-629775384 (note: the solution in the last comment in this issue does not work for us).

leoisl commented 8 months ago

It seems memory mapping is too slow, even with the DB on the local SSD. Loading the DB to RAM, Kraken2 can process 2M reads per minute. Using --memory-mapping it seems I am getting ~10k reads per hour, way too slow, but will leave the process running...

mbhall88 commented 8 months ago

In terms of the kraken read classification file that use 100's of MB, we would only need to keep this if we were planning on removing "contaminant" reads. Even then, if we did want to do that, we could easily do this in the same job and cleanup that file afterwards. The report will give us the overall stats on what percentage of reads are from species X etc.

Great work testing out the scaling of these databases.

I wonder if the best first thing to do it benchmark each one on some toy data and see what the precision/recall trade off is for the different size dbs? Because if we get equally good precision/recall for the 64GB for example, that should simplify all of this a little right?

leoisl commented 8 months ago

In terms of the kraken read classification file that use 100's of MB, we would only need to keep this if we were planning on removing "contaminant" reads. Even then, if we did want to do that, we could easily do this in the same job and cleanup that file afterwards.

Yeah, it would be interesting to re-assemble all datasets without contaminant reads... but I think we won't do that...

Great work testing out the scaling of these databases.

I wonder if the best first thing to do it benchmark each one on some toy data and see what the precision/recall trade off is for the different size dbs? Because if we get equally good precision/recall for the 64GB for example, that should simplify all of this a little right?

Yep, that would! I still did not get to this phase, will do it soon. I am just trying to figure out if I am not missing anything obvious on how to run 2M Kraken executions with a 256-GB DB with the EBI infrastructure in a reasonable amount of time...

Also we need to define what is reasonable time... In LSF, I can get 170 big-mem cores. If we assume every Kraken2 run takes 3 minutes, it would take 24 days to complete all, which seems reasonable? In practice will take more, as I am not accounting for the time to download reads and some runs failing, etc, but even 1.5 months seems reasonable...

But you are right, if we can use even the 128-GB DB, things get much easier to run...

leoisl commented 8 months ago

It seems memory mapping is too slow, even with the DB on the local SSD. Loading the DB to RAM, Kraken2 can process 2M reads per minute. Using --memory-mapping it seems I am getting ~10k reads per hour, way too slow, but will leave the process running...

I am stopping this, after 10h, only 110k reads were classified. That is 11k reads/hour it is way too slow, even backed by SSD...

mbhall88 commented 8 months ago

Also we need to define what is reasonable time... In LSF, I can get 170 big-mem cores. If we assume every Kraken2 run takes 3 minutes, it would take 24 days to complete all, which seems reasonable? In practice will take more, as I am not accounting for the time to download reads and some runs failing, etc, but even 1.5 months seems reasonable...

A couple of months seems very acceptable to me.

iqbal-lab commented 8 months ago

Yes I agree about 2 months. Also agree worth checking the 64G index. Fiddly bit is setting up the precision/recall test. Three separate things

  1. Limit of detection
  2. Correct assignment of species
  3. Correct abundance estimates Becomes quite involved; since our goal is primarily
    • is there one dominant species at abundance above x%
    • name that species and identify contamination Could simplify the test?
leoisl commented 8 months ago

Ok, so how well can we measure these things? I will just list some points here, but I'd need your help to know exactly how we should evaluate.

  1. Get a reasonable-sized, non-biased set of samples;
    • I was thinking on getting 10k samples spread across the phylogeny classified by Grace;
  2. Make plots showing the correlation of the most abundant species and the abundance estimates between the different Kraken2 DB versions (8, 16, 32, 64, 128 and 256 GB);

This is a mock of such plot:

image

In the Y-axis we would have the 5 smaller DBs (8, 16, 32, 64, 128), so we will actually have 5 such plots, and in the X-axis would be always the 256-GB DB. I think this plot could answer how well we can identify the dominant species and its abundance using the smaller DBs with respect to the largest one, and it is actually essential to justify our choice of the DB;

Would you propose any improvements to the previous points or any additional plots/experiments?

Note: We could also add Grace's classification and abundance estimates as ground truth, although I don't think we should. She used the Kraken 2 microbial database (2018, 30 GB), but it seems that GTDB is more comprehensive? Then it would not make sense to check if the GTDB classification is correct or not using the Kraken 2 microbial database;

iqbal-lab commented 8 months ago

100% agree - that's a great plot, and i agree don't treat grace's as ground truth as is different database. you probably don't need as many as 10k, 1k would be enough

mbhall88 commented 8 months ago

Ground truth is going to be hard from real datasets - unless you use a mock community I guess?

I think a good way to set it up might be simulate a sample and add in contaminants, that way we know the abundances.

An easier method might just be precision/recall of read classification of a simulated metagenomic sample that simulates from across all bacteria (like I created in https://www.biorxiv.org/content/10.1101/2023.09.18.558339v3 - though we can revisit whether this was sampled from enough of the tree, would be easy to expand it). Abundance ultimately is a "crude" approximation of read classification anway, so if we select based on best precision/recall at read classification level, this should also select for best db for abundance right? (This would also make it easier to compare across sizes as you have a simple numerical value.)

leoisl commented 8 months ago

An easier method might just be precision/recall of read classification of a simulated metagenomic sample that simulates from across all bacteria (like I created in https://www.biorxiv.org/content/10.1101/2023.09.18.558339v3 - though we can revisit whether this was sampled from enough of the tree, would be easy to expand it).

Yes, I agree. At most we would need minor tweaks on the sampling.

I could run either evaluation, I don't mind, but I think we should settle on one, as doing both would be redundant. The simulated dataset one seems a bit easier to do, as most of the code is already done and we just need to run Kraken once per DB size. Another good point is having the ground truth, but the "negative" point is not being real data (unsure if this indeed a negative point). Please let me know which one to go for.

leoisl commented 8 months ago

I am actually heavily in favour of the simulated approach, as will allow us to reuse what Michael has done and we might be able to solve this issue this week and start classifying the assemblies next week already

iqbal-lab commented 8 months ago

Yes, simulation is the way forward, agreed

leoisl commented 8 months ago

Ok, so I've read Michael's paper - very interesting and super useful! I have several questions and points to both of you listed below on how to best use the code and data we currently have to do this evaluation. I think reusing Michael's pipeline is essential to get this done correctly and quickly!

1. Human reads classification/filtering

We were thinking on using minimap2 with the T2T human reference to classify/quantify human reads. In your paper, it seems you did just that, and tested several other methods! And this benchmarking shows that Kraken HPRC have a good enough F-score while being an order of magnitude less than the other tools, so my vote goes to use it. I think the Kraken2 performance over the other tools is significant enough to choose it when running on 2M reads. Anyway, we should choose one method based on this table, there is no need for us to reevaluate human reads classification. So what would be your choice?

image

2. Bacterial reads classification

This is where we might have a few changes...

  1. How to simulate reads?

    • I can see that you simulate reads from genomes in the Kraken2 bacteria library, which are genomes from RefSeq I guess? I am OK with keeping the same simulation as you do, as the RefSeq taxonomy can be used as ground truth. However, I guess RefSeq is biased, so maybe it would be interesting to reduce max_asm to a small number, e.g. 100 or 50 to not bias the precision/recall to oversampled bacteria (e.g. e coli, tb, etc)?
    • We could use genomes from other DBs (e.g. GTDB, 661k, etc), but they will all still be biased. The way around the bias is to do a post-filtering on the DB genomes. And RefSeq gives us a taxonomy we can trust, so I think there is no need to explore other DBs...
    • For the remaining part of the simulating reads pipeline, I just reuse your rules, with the change that I don't treat TB as a special bacteria? i.e. I don't exclude it when filtering bacteria assemblies and don't run or remove the specific TB rules?
  2. Classify simulated reads

    • Is it as simple as running the kraken_classify rule, but changing the CLASSIFY_DBS to the various subsampled GTDBs we have? Could you confirm that this rule will actually already remove human reads using Kraken2 HPRC, which is what I was proposing to do in point 1 (I am trying to understand this pipeline just by reading it, I am running it as well, but it takes some time to setup the DBs...)?
  3. Evaluate classification

    • Does rule classify_summary_statistics already assess the reads classification on the Genus and Species level in a general way? It also assess on the MTB and MTBC levels, but for this use case we could just ignore these classifications?

3. Software engineering considerations

Questions on how we proceed with the development:

  1. Should you or I work on this? I already took the initiative of reading your paper and your pipeline, and your code is impressive with respect to readability and maintainability. I feel confident I can easily do the required changes, but if you prefer to do the changes yourself and have the bandwidth to do it now/soon, please go for it!
  2. Improve the classification_benchmark pipeline or fork and specialise it? As explained before, we need several small changes and turn off and on some parts of the pipeline. We could do this either by improving the current pipeline, making it more configurable, or forking it and tailor it to this use case. I don't know what you'd prefer, I have no preference.
iqbal-lab commented 8 months ago

Quick replies Human read counting Yes, agree on kraken + HPRC following Michael

bacterial read classification How's this Just take the HPRC taxonomy, and take 100 random species, making sure you include the top 10 species we think are most abundant in our data. Hprc gives you a genome for each. Then for each of them simulate 30x of 100bp reads. Then for these do the following

  1. Take a random 2nd species, and simulate reads such that they constitute [10%, 5%, 1%, 0.1%] of the total
  2. As above but take random reads from the 100 species mentioned above - ie noise.
mbhall88 commented 8 months ago

@iqbal-lab I think you might be confused on what HPRC (Human pangenome reference consortium) is based on your previous comment?

@leoisl Use the HPRC db with kraken if runtime is important, use Hostile otherwise.

A better way to simulate the bacteria compared to what I did could be by using https://github.com/pirovc/genome_updater to download representative GTDB genomes from across all bacteria (if you do a dry run it will tell you how many genomes it will download). Depending on how large that is we could reduce this set using assembly dereplicator or something similar. Essentially simulated illumina reads with ART will sample from each "contig" in the reference you pass, so you wont have biased sampled from the simulator. I think ART loads the whole reference into memory, but I can't be sure off the top of my head. If so, that would be the only limitation to having a bit representative reference to simulate from.

The only thing that will be a bit circular here is using GTDB rep. genomes will be in the kraken DBs we're testing. But I guess that doesn't matter considering we just want to see how much we lose with smaller DBs.

I mean we could just use the Illumina reads I simulated as I also have the truth species for each read in a TSV on the zenodo upload for the reads. Given we're only interested in the trade off with different DB sizes this might be sufficient?

For the remaining part of the simulating reads pipeline, I just reuse your rules, with the change that I don't treat TB as a special bacteria? i.e. I don't exclude it when filtering bacteria assemblies and don't run or remove the specific TB rules?

correct

Does rule classify_summary_statistics already assess the reads classification on the Genus and Species level in a general way? It also assess on the MTB and MTBC levels, but for this use case we could just ignore these classifications?

The species level should work, but might want to double check it doesn't have some weird TB-related assumptions.

Should you or I work on this? I already took the initiative of reading your paper and your pipeline, and your code is impressive with respect to readability and maintainability. I feel confident I can easily do the required changes, but if you prefer to do the changes yourself and have the bandwidth to do it now/soon, please go for it!

If you could that would be great as I am in the middle of writing a big fellowship application

Improve the classification_benchmark pipeline or fork and specialise it? As explained before, we need several small changes and turn off and on some parts of the pipeline. We could do this either by improving the current pipeline, making it more configurable, or forking it and tailor it to this use case. I don't know what you'd prefer, I have no preference.

Maybe just a new repo/pipeline (or even in the AllBacteria repo?). There's a lot of stuff we won't want/need in my repo/pipeline so just copy and paste what we need

iqbal-lab commented 8 months ago

I just meant use kraken with the hprc database?

iqbal-lab commented 8 months ago

I agree with this, re circularity

"But I guess that doesn't matter considering we just want to see how much we lose with smaller DBs"

leoisl commented 8 months ago

A better way to simulate the bacteria compared to what I did could be by using https://github.com/pirovc/genome_updater to download representative GTDB genomes from across all bacteria (if you do a dry run it will tell you how many genomes it will download).

Thanks, this is super useful! A simple

genome_updater.sh -d "refseq,genbank" -g "archaea,bacteria" -f "genomic.fna.gz" -o "GTDB_5_per_species" -M "gtdb" -A "species:5" -t 12 -m -k

selects 5 genomes for all species in GTDB! Unfortunately these 5 seem to not be random :( (Selection order based on: RefSeq Category, Assembly level, Relation to type material, Date.). If they were random, we would be done basically, but we can download a few hundreds per species and randomly downsample I guess.

I think ART loads the whole reference into memory, but I can't be sure off the top of my head. If so, that would be the only limitation to having a bit representative reference to simulate from.

This is ok, running a single big-mem job, if needed, is not an issue

I mean we could just use the Illumina reads I simulated as I also have the truth species for each read in a TSV on the zenodo upload for the reads. Given we're only interested in the trade off with different DB sizes this might be sufficient?

I agree with this, and sorry that I missed this resource... I went straight to reading your rules. These are the reads Michael refers to: https://zenodo.org/records/8339791 . I think this is sufficient as well. The only issue is that it is biased towards TB reads (proportions are 46% each for bacteria and human, 6% Mycobacterium tuberculosis complex (MTBC), and 1% each for virus and non-tuberculous mycobacteria (NTM)), but if this is not an issue (not for me), let's simply use these reads.

I just meant use kraken with the hprc database?

Yes, Kraken+HPRC will be used, but just to remove human reads from the already simulated reads

iqbal-lab commented 8 months ago

All sounds good. I do think, since we are really testing the effect of reducing database size, that it is not necessary to go huge in terms of numbers of genomes we simulate. I would be happier if I understood what kraken does to reduce db size- i think it is removing kmer-taxon pairs from its database, so the effect should be on the limit of detection (how much depth you need to detect something) rather than on whether it properly identifies species or bracken estimates abundance

mbhall88 commented 8 months ago

@iqbal-lab It downsamples the minimizers in the db

@leoisl you could just pull out the bacteria reads based on the truth tsv which would simplify the sampling proportions?

Could you confirm that this rule will actually already remove human reads using Kraken2 HPRC, which is what I was proposing to do in point 1 (I am trying to understand this pipeline just by reading it, I am running it as well, but it takes some time to setup the DBs...)?

there is a rule at the top which removes the human reads first. And no need to rebuild the dBs they’re all available on zenodo (see the readme)

leoisl commented 8 months ago

@leoisl you could just pull out the bacteria reads based on the truth tsv which would simplify the sampling proportions?

yes! will do that!

there is a rule at the top which removes the human reads first. And no need to rebuild the dBs they’re all available on zenodo (see the readme)

yep, thanks!

leoisl commented 4 months ago

This was done, but not by me