ncbi / SKESA

SKESA assembler
Other
111 stars 19 forks source link

Memory crash? #17

Closed Irrussional closed 4 years ago

Irrussional commented 5 years ago

Hello! I am trying to co-assemble a low-coverage genome from 3 deeply sequenced libraries, total size a bit less than 25GB. At first I tried a simple run and it returned an error "Memory provided is insufficient"

Then increased the memory availability to 600 GB, which gave a std::bad_alloc error too

Then I realized you're giving an exaple. So, if I take the 16GB per 20x cov 5Mbp genome formula (100Mbp), I'd want 16x250=4TB memory, is this logic correct? That's quite some memory! But I feel like I'm just misunderstanding (I am very new to bioinformatics still).

Anyway, please let me know what you think about this and possibly how to fix it .

Cheers! Artur

souvorov commented 5 years ago

No, you definitely don't need 4TB. Try the option --hash_count without any other memory requirements and let m know what happened. If your genome is estimated 250Mbp also use --estimated_kmers 2500.

How much physical memory is installed on your computer?

Irrussional commented 5 years ago

Okay, so it happened at kmer 59 again. std::bad_alloc I am operating a 2TB node, but not sure if I get all of it allocated for me.

souvorov commented 5 years ago

If you are on Linux, put /usr/bin/time -f "%U user %S system %E elapsed %M memory" before the skesa command and post what it reports and the command you used.

Are the reads publicly available?

Irrussional commented 5 years ago

Hi, I am not sure whether I should copy this part /usr/bin/time -f "%U user %S system %E elapsed %M memory" fully or do you expect me to fill in some values? if so, do I need to put linux or debian9 for system? what do I put in elapsed? do I put the amount of memory I allocate by the command in skesa later in the memory field? Sorry I am really not the most experienced user yet : )

souvorov commented 5 years ago

No worries:) On a linux put this line as it is before the skesa command like this: /usr/bin/time -f "%U user %S system %E elapsed %M memory" skesa ...(the option you use)

It will print out the actual memory and time used by your process.

Irrussional commented 5 years ago

okay, rerunning! one thing I was noticing - and it keeps happening - is that is reports "Bloom filter false positive rate is too high - increasing the bloom filter size and recalculating"

I am not entire sure if this is worth noting, but I thought I'd rather let you know.

souvorov commented 5 years ago

This message means that the genome size is higher than the estimate the program started with. It is not a problem by itself - the program will eventually figure it out. 1) Did you use --estimated kmers as I suggested? 2) What kind of genome are we talking about? What is an estimated size of the genome you are assembling? 3) Find the first line after "Bloom filter false positive rate.." which looks like the line below and post the numbers. The second number will tell us what the program thinks about the size of the genome. Initial kmers: 52438982 Kmers above threshold: 50726454 Total kmers: 1346997926 Hash table size: 78206256(2033.3MB) 4) If I can get the reads I will debug it for you.

tseemann commented 5 years ago

SKESA is a de-novo sequence read assembler for microbial genomes based on DeBruijn graphs

Just to confirm you are trying to extract/assemble a 5 Mbp genome (bacteria I assume) from a metagenomic data set? And that genome is low coverage? How low do you think it is? You will need at least 25x depth across it (@souvorov to confirm). What is in the metagenome - more bacteria? or can you remove human or host DNA first then attempt assembly? Do you have a related reference genome you could use to bait the reads of interest, and try assembling those?

Irrussional commented 5 years ago

@souvorov 1-2. I missed that out probably because my genome is 2.5 mbp max, in fact

  1. Initial kmers: 332539687 Kmers above threshold: 329726936 Total kmers: 32277201047 Hash table size: 495469192 (8918.4MB)
  2. Unfortunately it isn't publicly available.
Irrussional commented 5 years ago

@tseemann| a low-coverage ~2.5 Mbp genome from a low-diversity metagenomic dataset (not more than a dozen species). I'm not sure if I can achieve x25, but we will have to see: I'm currently combining 3 libraries to co-assemble and co-bin, in fact, so fingers crossed. The idea to remove the host reads sounds logical, I'll check with my supervisor why did we never discuss that :D Maybe they are sure we can co-assemble in anyway. The closest reference genome would be probably about 50% ANI so I don't think it's useful, correct me if I am wrong

tseemann commented 5 years ago

You have a very difficult problem recovering a MAG when you don't have much coverage or a good reference. I would use megahit or minia3 before SKESA in your situation. Good luck!

Irrussional commented 5 years ago

@tseemann I am actually running multiple assemblers to get a feeling for how different tools perform in different situations, thanks for the minia3 suggestion, will do that too!

Irrussional commented 5 years ago

so it seems to have crashed again, even though it ran for much longer this time here's the requested output: 2418145.77 user 6005.51 system 14:52:54 elapsed 519797824 memory

tseemann commented 5 years ago

Basically, you are trying to use an isolate assembler on a metagenome, and one with poor depth of the target. It will probably never work properly, unfortunately.

souvorov commented 4 years ago

There are two different issues - memory consumption and SKESA's usability for metagenomic samples.

At this point I cannot either explain or reproduce >500GB memory usage for 25GB of reads.

Metagenomic samples are definitely beyond the intended use of SKESA. Still, depending on the circumstances, it can produce some useful output. Although the target genome is small, SKESA knows nothing about it and will attempt to assemble everything in the sample. In this case the genome size for SKESA is a concatenation of all present genomes. According to the above kmer counts, a rough estimate of this effective genome size is ~300Mbp. There are two factors which affect the assembly - repeats and lack of coverage. SKESA is designed to stop at any repeat it cannot resolve with kmers. It is an issue for isolates, and it becomes increasingly bad for metagenomes of similar species. The 25x coverage is indeed the limit after which the assembly will become more fragmented and eventually disappear. The bottom line is that, even for metagenomes, SKESA should assemble unique sequences with reasonably high coverage.

Removing host reads should help.

tseemann commented 4 years ago

The bottom line is that, even for metagenomes, SKESA should assemble unique sequences with reasonably high coverage.

Yes I totally agree with that. But unfortunately @Irrussional has low coverage.

Irrussional commented 4 years ago

Thank you very much for useful clarifications! I assume this issue can be considered "solved"