Closed Irrussional closed 4 years ago
No, you definitely don't need 4TB. Try the option --hash_count without any other memory requirements and let m know what happened. If your genome is estimated 250Mbp also use --estimated_kmers 2500.
How much physical memory is installed on your computer?
Okay, so it happened at kmer 59 again.
std::bad_alloc
I am operating a 2TB node, but not sure if I get all of it allocated for me.
If you are on Linux, put /usr/bin/time -f "%U user %S system %E elapsed %M memory" before the skesa command and post what it reports and the command you used.
Are the reads publicly available?
Hi, I am not sure whether I should copy this part /usr/bin/time -f "%U user %S system %E elapsed %M memory"
fully or do you expect me to fill in some values? if so, do I need to put linux or debian9 for system? what do I put in elapsed? do I put the amount of memory I allocate by the command in skesa later in the memory field? Sorry I am really not the most experienced user yet : )
No worries:) On a linux put this line as it is before the skesa command like this: /usr/bin/time -f "%U user %S system %E elapsed %M memory" skesa ...(the option you use)
It will print out the actual memory and time used by your process.
okay, rerunning! one thing I was noticing - and it keeps happening - is that is reports "Bloom filter false positive rate is too high - increasing the bloom filter size and recalculating"
I am not entire sure if this is worth noting, but I thought I'd rather let you know.
This message means that the genome size is higher than the estimate the program started with. It is not a problem by itself - the program will eventually figure it out. 1) Did you use --estimated kmers as I suggested? 2) What kind of genome are we talking about? What is an estimated size of the genome you are assembling? 3) Find the first line after "Bloom filter false positive rate.." which looks like the line below and post the numbers. The second number will tell us what the program thinks about the size of the genome. Initial kmers: 52438982 Kmers above threshold: 50726454 Total kmers: 1346997926 Hash table size: 78206256(2033.3MB) 4) If I can get the reads I will debug it for you.
SKESA is a de-novo sequence read assembler for microbial genomes based on DeBruijn graphs
Just to confirm you are trying to extract/assemble a 5 Mbp genome (bacteria I assume) from a metagenomic data set? And that genome is low coverage? How low do you think it is? You will need at least 25x depth across it (@souvorov to confirm). What is in the metagenome - more bacteria? or can you remove human or host DNA first then attempt assembly? Do you have a related reference genome you could use to bait the reads of interest, and try assembling those?
@souvorov 1-2. I missed that out probably because my genome is 2.5 mbp max, in fact
@tseemann| a low-coverage ~2.5 Mbp genome from a low-diversity metagenomic dataset (not more than a dozen species). I'm not sure if I can achieve x25, but we will have to see: I'm currently combining 3 libraries to co-assemble and co-bin, in fact, so fingers crossed. The idea to remove the host reads sounds logical, I'll check with my supervisor why did we never discuss that :D Maybe they are sure we can co-assemble in anyway. The closest reference genome would be probably about 50% ANI so I don't think it's useful, correct me if I am wrong
You have a very difficult problem recovering a MAG when you don't have much coverage or a good reference. I would use megahit
or minia3
before SKESA in your situation. Good luck!
@tseemann I am actually running multiple assemblers to get a feeling for how different tools perform in different situations, thanks for the minia3 suggestion, will do that too!
so it seems to have crashed again, even though it ran for much longer this time
here's the requested output:
2418145.77 user 6005.51 system 14:52:54 elapsed 519797824 memory
Basically, you are trying to use an isolate assembler on a metagenome, and one with poor depth of the target. It will probably never work properly, unfortunately.
There are two different issues - memory consumption and SKESA's usability for metagenomic samples.
At this point I cannot either explain or reproduce >500GB memory usage for 25GB of reads.
Metagenomic samples are definitely beyond the intended use of SKESA. Still, depending on the circumstances, it can produce some useful output. Although the target genome is small, SKESA knows nothing about it and will attempt to assemble everything in the sample. In this case the genome size for SKESA is a concatenation of all present genomes. According to the above kmer counts, a rough estimate of this effective genome size is ~300Mbp. There are two factors which affect the assembly - repeats and lack of coverage. SKESA is designed to stop at any repeat it cannot resolve with kmers. It is an issue for isolates, and it becomes increasingly bad for metagenomes of similar species. The 25x coverage is indeed the limit after which the assembly will become more fragmented and eventually disappear. The bottom line is that, even for metagenomes, SKESA should assemble unique sequences with reasonably high coverage.
Removing host reads should help.
The bottom line is that, even for metagenomes, SKESA should assemble unique sequences with reasonably high coverage.
Yes I totally agree with that. But unfortunately @Irrussional has low coverage.
Thank you very much for useful clarifications! I assume this issue can be considered "solved"
Hello! I am trying to co-assemble a low-coverage genome from 3 deeply sequenced libraries, total size a bit less than 25GB. At first I tried a simple run and it returned an error "Memory provided is insufficient"
Then increased the memory availability to 600 GB, which gave a
std::bad_alloc
error tooThen I realized you're giving an exaple. So, if I take the 16GB per 20x cov 5Mbp genome formula (100Mbp), I'd want 16x250=4TB memory, is this logic correct? That's quite some memory! But I feel like I'm just misunderstanding (I am very new to bioinformatics still).
Anyway, please let me know what you think about this and possibly how to fix it .
Cheers! Artur