Estimating and reducing memory usage during build-custom

peterk87 commented 2 months ago

Hello,

I've been trying unsuccessfully to build a custom Ganon (v2.1.0) database, however, I am running out of memory on the raptor prepare stage on a machine with 512GB RAM. I have tried to build a database from NCBI nt viruses using the very useful instructions in the docs and have also tried to reduce the oversampling of certain viruses using the info in AllNuclMetadata.csv.gz, however, I'm not sure how far I need to go with downsampling. I can quickly and easily build a database from RefSeq Viruses, however, I would really like to have more viral diversity represented in the database for viral metagenomics.

Would a more appropriate strategy given my memory constraints be to start with a RefSeq Viruses database and incrementally add the viral species I'm interested in using the update functionality? Or should the database be split into multiple databases based on rank?

Thanks! Peter

pirovc commented 2 months ago

Hi Peter, you can try using the --filter-type ibf and maybe change the --mode to get smaller databases and use less RAM. However, using default HIBF is recommended due to benefits in querying speed and controlled false positive rates.

I believe raptor prepare is having issues due to the very large amount of sequences. One option is to merge them by taxid (like the blast command does in the documentation) and use it as input instead of many small sequence files. Building several databases split by rank is also a good idea, you can still use all filters at once in ganon classify. Just keep in mind that the classification will take a little longer that way.

The update functionality in ganon2 will just re-build the database, so I don't think it would make any difference here.

Would you mind sharing the command line you use to build the database?

peterk87 commented 2 months ago

Build command:

ganon build-custom \
  --input-file viruses.tsv \
  --taxonomy ncbi \
  --db-prefix viruses \
  --level sequence \
  --threads 16

where viruses.tsv contains:

viruses.fa.gz    NC_086348.1     426786
viruses.fa.gz    NC_086346.1     426786
viruses.fa.gz    NC_086347.1     426786
viruses.fa.gz    NC_083851.1     2283315
...many more such lines...

One option is to merge them by taxid (like the blast command does in the documentation) and use it as input instead of many small sequence files.

Right now I have all sequences in a single file. It looks like the build step creates about 200k symlinks to the same file in viruses_files/build. So it would be a good idea to write those sequences out to taxid specific files instead of having them all in a single file?

Building several databases split by rank is also a good idea, you can still use all filters at once in ganon classify.

That's great! I'll try it out as well as IBF instead of HIBF to see if that makes a difference.

As a follow-up, how much memory would be required to build a Ganon DB from NCBI nt and what kind of filtering and optimizations would you recommend?

Thanks for all the suggestions!

pirovc commented 2 months ago

Right now I have all sequences in a single file. It looks like the build step creates about 200k symlinks to the same file in viruses_files/build. So it would be a good idea to write those sequences out to taxid specific files instead of having them all in a single file?

There are 2 bottlenecks with your usage: first, using a single file will add overhead, since it has to be internally split into smaller files for ganon-build. Second, you are trying to build at the most fragmented possible level (sequence) which will require the most resources since k-mers have to be stored for many targets independently.

My suggestion is that you try the blast example from the documentation with db="nt_viruses", where the files will be previously split for you and the database built at taxonomic leaves level.

As a follow-up, how much memory would be required to build a Ganon DB from NCBI nt and what kind of filtering and optimizations would you recommend?

I don't have a precise estimation with the current data, but it's probably on the hundreds of GB. I'd suggest using --level species to reduce the number of targets, --max-fp 0.01 to reduce memory and disk usage with an increase of false positives.

peterk87 commented 2 months ago

Thanks for the suggestions! I did try setting --level to species and leaves, but I was running out of memory either way.

I'll try splitting up the sequences into different files at the species taxid level and setting the level to species and increasing the max FP to see if that makes a difference.

I had tried using a slightly modified approach to the blast example you had in the docs using the viruses subset from an NCBI BLAST nt DB (get_species_taxids.sh -t 10239 > viruses.taxidlist && blastdbcmd -taxidlist viruses.taxidlist -db nt -outfmt "%a %T %s" | awk ...), but that had failed to produce the HIBF file after a couple days though minimizer files were created for each fasta in the build directory. I'll try again and see if I can provide more useful info.

pirovc / ganon

Estimating and reducing memory usage during build-custom #289