Closed mbhall88 closed 3 years ago
The memory requirements increased as a result of no longer masking the founder VCFs. The sparse PRG had a peak RAM of 357 GB and a wallclock runtime of ~13.5 hours. The dense PRG had a peak RAM usage of 370 GB and a wallclock runtime of 44.66 hours.
I am now in the process of running the pandora analysis (#48) again with these non-masked PRGs and our new discover
defaults from https://github.com/rmcolq/pandora/issues/74. The discover part of the pipeline runs fine, but the MSAs are now taking at least a day for quite a lot of samples. These all appear to be stuck on one or two PRGs. Looking at the discover logs for these it looks like it could be due to these loci have a lot of paths added.
I am going to expose the parameter that sets the limit for the maximum number of paths and try this with a smaller value to see if this helps speedup the MSA (augment) process.
Ok just as a reminder, to whisper like satan in your ear, the other alternative is to throw in the bin all local graphs in the mask (ie with too high density if masked positions). Your RAM use and runtime will drop hugely 😍
I can always just go back to masking the variants that go into the PRGs too.
It's like whack-a-mole. Pandora now runs a lot faster, but MSA is slower.
If we mask the variants that go into the PRG, pandora is slower and MSA gets faster...
Pandora won't be slower if the entire local prg is gone. Nothing to map to,no assembly needs to be done, no augmenting, time saved.
In theory we might worry about reads matching to the wrong place,but we're removing all those places too. Try it, I bet you a Thai meal.
Prior to trying out Satan's (@iqbal-lab) suggestion :smiling_imp: here are some stats.
So, if we remove any locus that overlaps with the intervals in the mask we throw away
462 (17.89%) loci removed
990097bp (22.44%) of the genome removed
The build process masking loci has finished with the following stats for the MSA component.
PRG | wallclock time | peak RAM |
---|---|---|
sparse | 0.7h | 180 GB |
dense | 2.97h | 274 GB |
bloody hell its still a lot
After running pandora with the "masked-loci" PRG, on the samples with assemblies
The right two boxes are the "masked-position" PRG (the original approach) for comparison. #nofilter
Masking out loci seems like a bad idea. I know we are happy to lost some recall, but that's a lot of recall lost with no gain in precision. We do gain performance, but I don't know if it is worth the recall lost...
There are two options from here @iqbal-lab :
My knee-jerk reaction is option 2. But I will think about this more over the new couple of days.
There is also a possible third option. Looking at the plots, it seems like our "dense" definition is still now particularly useful. We could potentially reduce the number of samples going into the dense PRGs further...
Hi @mbhall88 great to see results. Must admit my reaction was the reverse of yours. Let me set the scene. First pre masking, we were finding RAM and time were high. Second you masked aggressively, removing any locus that had any mask overlap, dropping genome size by 20%. The actual oxford mask size is, I think, only 10% of the genome. So we lost 10% of genome just due to using a quick and dirty approach, which is fine this was exploring . So, I wanted to see improved performance, a drop in recall of no more than 10% and no worse precision.
So what did we see? The drop in recall was 10%, precisely the same as the drop in size of genome, so that seems fine to me. Precision was unaffected, confirming that removing masked loci does not cause false mapping.
So, my conclusion is, removing masked loci improves ram/speed performance at no cost in precision. It reduces recall in proportion to the amount of non masked genome you remove, as expected . So, if you go back to the start and create loci that don't overlap regions where the mask dominates (density of masked positions is hugh), you will regain the lost 10%, improve performance, and retain precision
Well, the loss is more like 13-14%.
So you're saying you want to try option 1? Or no option and just continue on with the masked loci approach?
I want to try constructing loci such that we don't have a locus at all for regions of h37rv that are full of Oxford mask positions. If we imagine a plot if h37rv, with black bars over the mask,I want to make prg loci of the white stuff. I realise in reality the mask is a spray of dots, so the text below addresses that.
If we took (say) 100bp windows of h37rv and counted how many bases in each 100bp were oxford masked, and then plotted a histogram showing how many windows were completely full of mask, how many were 10%,... 80%, 90% full, I'm expecting most of the windows are empty, and 10% is to be full. I am unclear how much is in the middle. If we look at that plot, we'll see where to draw a line. Say we draw a line at 70% because we find windows either are less than 20% full of mask or more than 70%. Now we have a more blocky mask . So we start merging masked windows. Hopefully this allows us to remove ppes etc without removing too much more genome.
At the end of that you're left with safe genome regions where you can happily start to make loci . I don't see any benefit in your dense prg so I'd not even build it.
The only remaining question is whether it's possible to prevent merging of loci growing to 20kb.
Here are some plots investigating masked windows within the genome. The way these plots work is for a window (w) and a step size (s), we look at the proportion of positions within w that are within the mask and then move the window to the right by s positions. Note: the y-axis is log-scaled
Note the x-axis limit
A further analysis is to look at how much of the genome we lose if we mask at the GFF-level - i.e. before we merge things.
This analysis is done using bedtools subtract -A -a h37rv.gff3 -b compass-mask.bed -f <float>
where -f <float>
indicates the minimum overlap required on the GFF to remove that interval. So 0.1 would mean if 10% of the GFF feature overlaps with the mask, then we remove it. -A
just means remove the entire feature if it passes the overlap criteria.
There are 3978 gene features in the H37Rv GFF file.
We can look at how much of the genome we lose in terms of genes and in terms of positions
The left-most point on the x-axis in these plots (0.0) indicates when we remove a gene when it has any overlap (i.e. even 1bp)
Disclaimer: this analysis does not factor in lost intergenic regions as these aren't in the GFF file
I have just finished working on a prototype run for masking at the GFF-level - including intergenic regions, which were missing from the above analysis.
This prototype allows for specifying the minimum fraction of overlap between the mask and a locus.
Here are the above plots, but this time including intergenic regions too. Of note, we throw away any locus if is smaller than our minimum length (500bp) and has no adjoining neighbouring locus.
I have added the ability to mask the GFF in ccd298f, but there still remains one bit of uncertainty. If we set our minimum overlap fraction to say 0.3 and a locus has 0.25 overlap, it stays in our PRG building process. The next step is to add variants to these loci from the dense/sparse VCFs. It could happen that a VCF position is in the mask, but its locus is missing - in which case we have no problem - or its locus is present. Should we add this position if its locus is present?
You're asking whether to continue to mask masked positions that survive our locus-removal process? So, the argument for retaining it would be to avoid wasting time doing de novo there i guess. I'm neutral, but either way remember these positions for debugging later?
Ok. I'm going to make a first pass without VCF masking and see how that goes.
After running the pandora pipeline with the masked-gff PRG
prg | max. wallclock (h) | mean wallclock (h) | SD wallclock (h) | max. RAM (GB) | mean RAM (GB) | SD RAM (GB) |
---|---|---|---|---|---|---|
sparse | 0.4 | 0.29 | 0.04 | 10.0 | 8.75 | 1.01 |
dense | 1.0 | 0.85 | 0.09 | 15.8 | 13.0 | 1.07 |
SD = standard deviation
prg | max. wallclock (h) | mean wallclock (h) | SD wallclock (h) | max. RAM (GB) | mean RAM (GB) | SD RAM (GB) |
---|---|---|---|---|---|---|
sparse | 2.25 | 0.57 | 0.23 | 2.75 | 1.78 | 0.45 |
dense | 6.17 | 0.81 | 0.67 | 3.0 | 1.93 | 0.48 |
In the following plots the labels equate to:
-N 10
), default discover
parameters (see d15L30P30abun2k15 below) and using the most recent PRG - i.e. the one where we mask at the GFF level with a minimum required overlap with the mask of 30% to remove a locus.discover
parameters (see d15L30P30abun2k15 below) and a PRG where we mask (remove) any locus that overlaps the mask. The difference between this and the masked-gff version is we remove a locus after all of the loci have been merged for overlaps and shortness.-d 15
), maximum candidate region length of 30 (-L 30
), padding of 26bp either side of a candidate region (this param no longer exists as it is now fixed at 2*de novo kmer size), minimum abundance of 2 in the de Bruijn graph --min-abun-dbg 2
, and a de novo k-mer size of 13 --discover-k 13
.-d 15
), maximum candidate region length of 30 (-L 30
), padding of 30bp either side of a candidate region (this param no longer exists as it is now fixed at 2*de novo kmer size), minimum abundance of 2 in the de Bruijn graph --min-abun-dbg 2
, and a de novo k-mer size of 15 --discover-k 15
.Note: N10masked-gff is the most recent result and is the one the above runtimes and RAM usage are relating to.
Relating to https://github.com/rmcolq/pandora/issues/74#issuecomment-718368343