mbhall88 / head_to_head_pipeline

Snakemake pipelines to run the analysis for the Illumina vs. Nanopore comparison.
GNU General Public License v3.0
5 stars 2 forks source link

Mask GFF features for PRG construction #55

Closed mbhall88 closed 3 years ago

mbhall88 commented 3 years ago

Relating to https://github.com/rmcolq/pandora/issues/74#issuecomment-718368343

mbhall88 commented 3 years ago

The memory requirements increased as a result of no longer masking the founder VCFs. The sparse PRG had a peak RAM of 357 GB and a wallclock runtime of ~13.5 hours. The dense PRG had a peak RAM usage of 370 GB and a wallclock runtime of 44.66 hours.

mbhall88 commented 3 years ago

I am now in the process of running the pandora analysis (#48) again with these non-masked PRGs and our new discover defaults from https://github.com/rmcolq/pandora/issues/74. The discover part of the pipeline runs fine, but the MSAs are now taking at least a day for quite a lot of samples. These all appear to be stuck on one or two PRGs. Looking at the discover logs for these it looks like it could be due to these loci have a lot of paths added.

I am going to expose the parameter that sets the limit for the maximum number of paths and try this with a smaller value to see if this helps speedup the MSA (augment) process.

iqbal-lab commented 3 years ago

Ok just as a reminder, to whisper like satan in your ear, the other alternative is to throw in the bin all local graphs in the mask (ie with too high density if masked positions). Your RAM use and runtime will drop hugely 😍

mbhall88 commented 3 years ago

I can always just go back to masking the variants that go into the PRGs too.

It's like whack-a-mole. Pandora now runs a lot faster, but MSA is slower.

If we mask the variants that go into the PRG, pandora is slower and MSA gets faster...

iqbal-lab commented 3 years ago

Pandora won't be slower if the entire local prg is gone. Nothing to map to,no assembly needs to be done, no augmenting, time saved.

In theory we might worry about reads matching to the wrong place,but we're removing all those places too. Try it, I bet you a Thai meal.

mbhall88 commented 3 years ago

Prior to trying out Satan's (@iqbal-lab) suggestion :smiling_imp: here are some stats.

So, if we remove any locus that overlaps with the intervals in the mask we throw away

462 (17.89%) loci removed
990097bp (22.44%) of the genome removed
mbhall88 commented 3 years ago

The build process masking loci has finished with the following stats for the MSA component.

PRG wallclock time peak RAM
sparse 0.7h 180 GB
dense 2.97h 274 GB
iqbal-lab commented 3 years ago

bloody hell its still a lot

mbhall88 commented 3 years ago

After running pandora with the "masked-loci" PRG, on the samples with assemblies

Recall

The right two boxes are the "masked-position" PRG (the original approach) for comparison. #nofilter

image

Precision

image

Conclusion

Masking out loci seems like a bad idea. I know we are happy to lost some recall, but that's a lot of recall lost with no gain in precision. We do gain performance, but I don't know if it is worth the recall lost...

There are two options from here @iqbal-lab :

  1. Try out masking at the GFF level
  2. Go back to the original masking out VCF positions that go into the PRG

My knee-jerk reaction is option 2. But I will think about this more over the new couple of days.

There is also a possible third option. Looking at the plots, it seems like our "dense" definition is still now particularly useful. We could potentially reduce the number of samples going into the dense PRGs further...

iqbal-lab commented 3 years ago

Hi @mbhall88 great to see results. Must admit my reaction was the reverse of yours. Let me set the scene. First pre masking, we were finding RAM and time were high. Second you masked aggressively, removing any locus that had any mask overlap, dropping genome size by 20%. The actual oxford mask size is, I think, only 10% of the genome. So we lost 10% of genome just due to using a quick and dirty approach, which is fine this was exploring . So, I wanted to see improved performance, a drop in recall of no more than 10% and no worse precision.

So what did we see? The drop in recall was 10%, precisely the same as the drop in size of genome, so that seems fine to me. Precision was unaffected, confirming that removing masked loci does not cause false mapping.

So, my conclusion is, removing masked loci improves ram/speed performance at no cost in precision. It reduces recall in proportion to the amount of non masked genome you remove, as expected . So, if you go back to the start and create loci that don't overlap regions where the mask dominates (density of masked positions is hugh), you will regain the lost 10%, improve performance, and retain precision

mbhall88 commented 3 years ago

Well, the loss is more like 13-14%.

So you're saying you want to try option 1? Or no option and just continue on with the masked loci approach?

iqbal-lab commented 3 years ago

I want to try constructing loci such that we don't have a locus at all for regions of h37rv that are full of Oxford mask positions. If we imagine a plot if h37rv, with black bars over the mask,I want to make prg loci of the white stuff. I realise in reality the mask is a spray of dots, so the text below addresses that.

If we took (say) 100bp windows of h37rv and counted how many bases in each 100bp were oxford masked, and then plotted a histogram showing how many windows were completely full of mask, how many were 10%,... 80%, 90% full, I'm expecting most of the windows are empty, and 10% is to be full. I am unclear how much is in the middle. If we look at that plot, we'll see where to draw a line. Say we draw a line at 70% because we find windows either are less than 20% full of mask or more than 70%. Now we have a more blocky mask . So we start merging masked windows. Hopefully this allows us to remove ppes etc without removing too much more genome.

At the end of that you're left with safe genome regions where you can happily start to make loci . I don't see any benefit in your dense prg so I'd not even build it.

The only remaining question is whether it's possible to prevent merging of loci growing to 20kb.

mbhall88 commented 3 years ago

Here are some plots investigating masked windows within the genome. The way these plots work is for a window (w) and a step size (s), we look at the proportion of positions within w that are within the mask and then move the window to the right by s positions. Note: the y-axis is log-scaled

w=100 s=1

image

w=1000 s=10

image

w=10000 s=100

Note the x-axis limit

image

w=100 s=100

image

mbhall88 commented 3 years ago

A further analysis is to look at how much of the genome we lose if we mask at the GFF-level - i.e. before we merge things.

This analysis is done using bedtools subtract -A -a h37rv.gff3 -b compass-mask.bed -f <float> where -f <float> indicates the minimum overlap required on the GFF to remove that interval. So 0.1 would mean if 10% of the GFF feature overlaps with the mask, then we remove it. -A just means remove the entire feature if it passes the overlap criteria.

There are 3978 gene features in the H37Rv GFF file.

We can look at how much of the genome we lose in terms of genes and in terms of positions

The left-most point on the x-axis in these plots (0.0) indicates when we remove a gene when it has any overlap (i.e. even 1bp)

Genes lost

image

Positions lost

image

Disclaimer: this analysis does not factor in lost intergenic regions as these aren't in the GFF file

mbhall88 commented 3 years ago

I have just finished working on a prototype run for masking at the GFF-level - including intergenic regions, which were missing from the above analysis.
This prototype allows for specifying the minimum fraction of overlap between the mask and a locus.

Here are the above plots, but this time including intergenic regions too. Of note, we throw away any locus if is smaller than our minimum length (500bp) and has no adjoining neighbouring locus.

Loci lost

image

Positions lost

image

mbhall88 commented 3 years ago

I have added the ability to mask the GFF in ccd298f, but there still remains one bit of uncertainty. If we set our minimum overlap fraction to say 0.3 and a locus has 0.25 overlap, it stays in our PRG building process. The next step is to add variants to these loci from the dense/sparse VCFs. It could happen that a VCF position is in the mask, but its locus is missing - in which case we have no problem - or its locus is present. Should we add this position if its locus is present?

iqbal-lab commented 3 years ago

You're asking whether to continue to mask masked positions that survive our locus-removal process? So, the argument for retaining it would be to avoid wasting time doing de novo there i guess. I'm neutral, but either way remember these positions for debugging later?

mbhall88 commented 3 years ago

Ok. I'm going to make a first pass without VCF masking and see how that goes.

mbhall88 commented 3 years ago

Performance

After running the pandora pipeline with the masked-gff PRG

Update MSAs

prg max. wallclock (h) mean wallclock (h) SD wallclock (h) max. RAM (GB) mean RAM (GB) SD RAM (GB)
sparse 0.4 0.29 0.04 10.0 8.75 1.01
dense 1.0 0.85 0.09 15.8 13.0 1.07

SD = standard deviation

Discover

prg max. wallclock (h) mean wallclock (h) SD wallclock (h) max. RAM (GB) mean RAM (GB) SD RAM (GB)
sparse 2.25 0.57 0.23 2.75 1.78 0.45
dense 6.17 0.81 0.67 3.0 1.93 0.48

Results

In the following plots the labels equate to:

Note: N10masked-gff is the most recent result and is the one the above runtimes and RAM usage are relating to.

Recall

image

Precision

image