iqbal-lab-org / pandora

Pan-genome inference and genotyping with long noisy or short accurate reads
MIT License
109 stars 14 forks source link

Acceptable local assembly performance #74

Closed iqbal-lab closed 3 years ago

iqbal-lab commented 5 years ago

@mbhall88 has been running "latest" Pandora with the perf changes you 3 produced on the full graph with a cardio sample. So far running for days, sitting in the "get intervals" stage. @mbhall88 feel free to edit/update with details.

rmcolq commented 5 years ago

Curious to know which bit it is getting stuck in specifically. Also, did we finish adding tests for all of the work? Looking at the latest dev commit there are some tests missing/where important bits are commented out.

mbhall88 commented 5 years ago

@rmcolq can you please link to examples you are talking about?

mbhall88 commented 5 years ago

Job was killed 😞 Will restart it with some changes I have made.

mbhall88 commented 5 years ago

14 is related to this and identifies a possible slow spot.

rmcolq commented 5 years ago

If by examples, you mean which functions am I referring to as not having full tests? add_pnode_coordinate_pairs seems to have partial test, the important bit is commented out in line https://github.com/rmcolq/pandora/blob/87c0d5d02edee404d183acf1b1940ec219689c3e/test/denovo_discovery/extract_reads_test.cpp#L988 collect_read_pileups has no test.

Commented out tests that could be adjusted to new code as extra test cases: https://github.com/rmcolq/pandora/blob/87c0d5d02edee404d183acf1b1940ec219689c3e/test/denovo_discovery/extract_reads_test.cpp#L169 https://github.com/rmcolq/pandora/blob/87c0d5d02edee404d183acf1b1940ec219689c3e/test/denovo_discovery/extract_reads_test.cpp#L185

Also noticed we have two add_pnode_coordinate_pairs in the header file, the second along with save_read_strings_to_denovo_assemble is junk.

When Zam says you have been "sitting in the 'get intervals' stage", do you know where exactly?

mbhall88 commented 5 years ago

Ok. I have removed the junk.

@rmcolq Regarding the tests, I thought you said you could make some tests quite easily for those functions?

@rmcolq If I jump to the bottom of the log file at the moment, it shows the following

continue checking
check if pnodes defined
check if one is defined
again
continue checking
check if pnodes defined
check if one is defined
again
continue checking
check if pnodes defined
check if one is defined
again
continue checking
check if pnodes defined
check if one is defined
again
continue checking

This code comes from the < operator for GeneIntervalInfo https://github.com/rmcolq/pandora/blob/83e9858c467128d8d528121187fb73969cfe0160/src/denovo_discovery/gene_interval_info.cpp#L6

I am guessing this is being printed out whenever there is some kind of insertion into a map happening?
Therefore I am assuming it is collecting read pileups.
Will try and get some Valgrind CPU assessment on the de novo stuff today.

rmcolq commented 5 years ago

I did indeed. I had lost track. I will write some tests for that.

I think it is, but if you point me at the log I will take a look and see.

Sent from my Samsung Galaxy smartphone.

-------- Original message -------- From: Michael Hall notifications@github.com Date: 22/10/2018 13:30 (GMT+00:00) To: rmcolq/pandora pandora@noreply.github.com Cc: Rachel Norris rmnorris@well.ox.ac.uk, Mention mention@noreply.github.com Subject: Re: [rmcolq/pandora] Acceptable pandora map+de novo performance (#74)

Ok. I have removed the junk.

@rmcolqhttps://github.com/rmcolq Regarding the tests, I thought you said you could make some tests quite easily for those functions?

@rmcolqhttps://github.com/rmcolq If I jump to the bottom of the log file at the moment, it shows the following

continue checking check if pnodes defined check if one is defined again continue checking check if pnodes defined check if one is defined again continue checking check if pnodes defined check if one is defined again continue checking check if pnodes defined check if one is defined again continue checking

This code comes from the < operator for GeneIntervalInfo https://github.com/rmcolq/pandora/blob/83e9858c467128d8d528121187fb73969cfe0160/src/denovo_discovery/gene_interval_info.cpp#L6

I am guessing this is being printed out whenever there is some kind of insertion into a map happening?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/rmcolq/pandora/issues/74#issuecomment-431821002, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AJaLO0JpC3CRdyP0t1h2yt92r0G5OMNAks5unbpNgaJpZM4XxSPP.

mbhall88 commented 5 years ago

Profiled CPU usage on a toy example using callgrind.

valgrind --tool=callgrind /Users/mbhall88/Projects/pandora/cmake-build-debug/pandora map --output_covgs --discover -p toy_prg.fa -r toy_5000_mapped.fastq.gz -o . --genome_size 20000 --genotype

callgrind.out.70583.txt

It segfaulted when it was running local assembly on an interval, but I have results for the code prior to that. It appears that up until it segfaulted, 98.07% of CPU time was in FastaqHandler::get_id() and more specifically on this line https://github.com/rmcolq/pandora/blob/fae5c3fd167364e25793ce250d79483fbe09d108/src/fastaq_handler.cpp#L138

Speaking with @rffrancon it appears this line is not necessary as it seems like it is being used to clear the istream, but two lines below clear() is called on the istream anyways. Unless there was a specific reason for this line @rmcolq I will remove it and rerun the analysis.

screenshot 2018-10-23 10 03 51

rmcolq commented 5 years ago

Excellent, if it runs without it, and tests pass, do remove it!

ffranr commented 5 years ago

Gramtools uses SeqRead for parsing fastq, see: https://github.com/iqbal-lab-org/gramtools/blob/ad47bb63a09c4b0b1f92f005e2059ce842223567/libgramtools/src/quasimap/quasimap.cpp#L74

And here: https://github.com/iqbal-lab-org/gramtools/blob/ad47bb63a09c4b0b1f92f005e2059ce842223567/libgramtools/src/quasimap/quasimap.cpp#L132

iqbal-lab commented 5 years ago

fwiw https://github.com/rob-p/FQFeeder

rmcolq commented 5 years ago

At one point I tried FQFeeder - at the time it was slower, but I was doing different things. It may be faster for this.

ffranr commented 5 years ago

Seems very popular: https://github.com/seqan/seqan https://seqan.readthedocs.io/en/seqan-v1.4.2/Tutorial/SequenceFileIO.html

#include <fstream>
#include <iostream>

#include <seqan/seq_io.h>
#include <seqan/sequence.h>

int main(int argc, char const ** argv)
{
    if (argc != 2)
        return 1;  // Invalid number of arguments.

    // Open file and create RecordReader.
    std::fstream in(argv[1], std::ios::binary | std::ios::in);
    seqan::RecordReader<std::fstream, seqan::SinglePass<> > reader(in);

    // Read file record-wise.
    seqan::CharString id;
    seqan::Dna5String seq;
    seqan::CharString qual;
    while (!atEnd(reader))
    {
        if (readRecord(id, seq, qual, reader, seqan::Fastq()) != 0)
            return 1;  // Could not record from file.

        std::cout << id << "\t" << seq << "\n";
    }

    return 0;
}
rmcolq commented 5 years ago

Now added missed tests in commit ba4078cb3006e1aa42c9aab60b2e0495bd87be65

mbhall88 commented 5 years ago

Most recent callgrind analysis shows the following call graph

screenshot 2018-10-24 14 55 50

This is obviously only a small portion of the overall call graph, but is where about half the run-time is at the moment on my toy dataset. This is the higher level shot

screenshot 2018-10-24 14 58 19

This example runs local assembly on two intervals in total.

Here is the callgrind log file callgrind.out.494.txt

mbhall88 commented 5 years ago

Looking into this a little, it seems - for this toy example - 21.86% of time is spent on this https://github.com/rmcolq/pandora/blob/5786548a1a1111a4990f0b8a6ec3335e4d1a5319/src/minimizer.cpp#L12

@rmcolq Is there a better way of achieving this? A lookup table or approximation perhaps? It is called 6 million times.

ffranr commented 5 years ago

I didn't think of this before now as i dont really use asserts: https://codeyarns.com/2015/03/16/how-to-disable-assert-in-gcc/

I would go with the cmake option.

iqbal-lab commented 5 years ago

Correct me if wrong @rffrancon , but I think Robyn is saying you can set things up so you can compile the code in two ways. In "debug mode" all asserts are there. You can do unit tests and you can do test runs. In "ndebug mode", those extra asserts are stripped out, so they do not slow you down.

mbhall88 commented 5 years ago

Is this line where I would put the compiler flag? https://github.com/rmcolq/pandora/blob/5786548a1a1111a4990f0b8a6ec3335e4d1a5319/CMakeLists.txt#L24

mbhall88 commented 5 years ago

Compiling with -DNDEBUG flag on changes things quite a bit in terms of percentage of CPU time spent in certain functions. Only 2.07% of time now spent initialising minimisers, compared to 25.68% before.

Here is an overview of the call graph now.

screenshot 2018-10-25 10 49 15

In terms of absolute time. Averaging 5 runs on the toy dataset for each:

Which is an 11% speedup on the toy dataset.

mbhall88 commented 5 years ago

Going to close this as speed is now "acceptable" and #150 is tracking the performance of the map module now.

mbhall88 commented 3 years ago

I am opening this issue to track an investigation I am currently doing to pinpoint concrete examples of where our slow points in local assembly are. This analysis also feeds directly into #195

Whilst trying to use pandora in the head to head analysis (see https://github.com/mbhall88/head_to_head_pipeline/issues/48) I have been rerunning pandora discover a lot. What I have been struggling with is that there are examples of samples that can take upwards of 4 days to get through all candidate regions. And in some cases, these never finish before I kill the job.

I selected an example of a sample that took 4 days to (successfully) run pandora discover - sample mada_1-25. In the head to head analysis, I am running each sample twice with two PRGs of different density - sparse (variants from ~80 samples) and dense (variants from ~400 samples).

I have parsed the log file to get information about how long local assembly took on each candidate region and how long the region is to see whether the length of a candidate region is correlated with its runtime, or if there is something else at play.

image

image

Conclusion

The length of a candidate region, at least in this example, does not seem to have a noticeable impact on the amount of time it takes to run local assembly. We can also see that, for this sample, there is a single candidate region that takes nearly all of the time.

75% of all candidate regions take 0.26-0.37 seconds to complete, with the remaining 25% taking next to no time (these are likely regions where there are little-to-no reads to build the dBG from and other various GATB errors).

Take a sample with 10,000 candidate regions, each running for 0.3 seconds, then we will spend 37.5 minutes enumerating paths through local assembly graphs. Theoretically, if #195 is implemented, each extra thread should equate to a proportional reduction in local assembly runtime.

However, this does not address the issue of single regions that take 4 days! When I looked at this candidate region a little closer, I found it was PE_PGRS54+IGR:3936710-3936876 - interval [1916, 2046) and for each PRG it produced 14 paths. It tried 4 combinations of start/end kmers before it was able to find any paths. Each of these enumerations took ~12 hours. I calculated the GC-content for this gene and it is one of the genes with 80% GC-content. So I suspect that there is a huge amount of cycles causing this slowdown. (As an aside, I also saw another example of this with PE_PGRS53, which has 78% GC-content).

Next step

I will pull out this PRG and the reads that map to it and play around with discover params that can help reduce this runtime.

iqbal-lab commented 3 years ago

Do you keep a count of how many "loops you meet"? Maybe no of times you meet a kmer you've met before? Coukd have a ceiling. In principle this can be partially done at construction time not runtime. We coukd take the reference for a locus and count how many repeated kmers there are

iqbal-lab commented 3 years ago

Sorry I accidentally closed the issue

leoisl commented 3 years ago

Do you keep a count of how many "loops you meet"? Maybe no of times you meet a kmer you've met before? Coukd have a ceiling. In principle this can be partially done at construction time not runtime. We coukd take the reference for a locus and count how many repeated kmers there are

Sorry for the long message, it seems I get carried away when brainstorming...

We had a talk on late monday UK / early tuesday Aus, and we actually came up with a similar idea, but that was on runtime. We could limit how many times we loop through cycles simply by having a count of how many times we visited each kmer when building a path, and if we visit a kmer up to, for eg 5 times, we don't consider it as a candidate to get in the path again (sort of removing the kmer from the graph). But this was during runtime and during path building... Removing the kmer at construction time has the impact of being faster (i.e. we won't even loop through repeat kmers, as they will be straight out removed), but it might really affect results (e.g. it might be that the true denovo path to be found goes through the repeat, and removing these repeat kmers makes it impossible to find it).

I guess we need to wait for more details on this region from Michael. I would like very much to see the graph (I think GATB can export to GFA? Or just export to a tabular format, and we can view on cytoscape). I wonder if we have several loops over loops, etc there. It is fine that we don't perform well on this. These repeat-induced-full-of-cycles graphs are one of the main hardness on assemblying genomes, everyone struggles...

This was one of my PhD topics, trying to solve repeats (which induce cycles), but assemblying human alternative splicing events. I did not succeed much, but some strategies that I remember:

There are several other ways to tackle repeats/cycles/complex assembly graphs, we might need to research and pick the one that fits better our data

iqbal-lab commented 3 years ago

i was simply suggesting at construction time counting high-repeat kmers - i wasnt suggesting removing them from the middle of the graph. If there are many/any, we could decide this region is too hard to do local assembly.

leoisl commented 3 years ago

Oh, I see! It is another way of deciding if the graph is complex or not. We could increase the k-mer size and recompute this stat instead of directly abandoning the assembly, do you think is worth it?

iqbal-lab commented 3 years ago

the mccortex read-coherence thing is expensive and uses i/o. right now i'm more thinking in terms of discarding candidate regions if they are too repeaty. They would also be difficult regions for mapping/normal-approaches anyway. No need for tonnes of machinery if this is just losing a small amount of genome

iqbal-lab commented 3 years ago

the cost/benefit is all about how many regions there are like this.

iqbal-lab commented 3 years ago

important to avoid trying too hard to squeeze tiny increases of theoretical recall

iqbal-lab commented 3 years ago

but my idea of looking at the locus might/not spot michael's problem region. and it might spot loads of others that local assembly copes with fine. we need data really

leoisl commented 3 years ago

important to avoid trying too hard to squeeze tiny increases of theoretical recall

I do agree with this, especially with my recent experience on pandora paper. Finding the correct denovo path for this single region among several thousand other regions won't really make a difference in the results... I changed my idea from trying to deal with these complicated regions to identifying and skipping them

but my idea of looking at the locus might/not spot michael's problem region. and it might spot loads of others that local assembly copes with fine. we need data really

Yeah, I think we need a measure to identify the complexity of the region, which correlates with the time spent on it. Maybe a high GC-content could flag a complex region, and we don't process it (although I don't think this was the single high-GC-content region), or maybe the nb of repeated kmers, or nb of cycles, etc...

IIRC, Michael also suggested a timeout on the finding-denovo-paths algorithm. I previously opposed to this, as things can become non-reproduceable, and there might be some other external factors that can impact runtime (e.g. filesystem latency, cpu load, etc), and we might end up missing a bunch of regions... but a timeout is indeed an option rethinking again, as almost everything takes <5s, and a single one takes 48h (not sure this is the behaviour of other datasets)

iqbal-lab commented 3 years ago

not sure if this link will work https://bacteria.ensembl.org/Mycobacterium_tuberculosis_h37rv/Location/View?r=Chromosome:3936710-3936876;site=ensemblthis;db=core

so attaching a screenshot. that's a lot of consecutive G/C

Screenshot 2020-10-27 at 16 28 31
mbhall88 commented 3 years ago

No, we do not currently keep track of how many times we visit a kmer. I could definitely implement this but would need to alter the data structure for out DFS tree. I will come back to this...

@leoisl unfortunately GATB doesn't output GFA, only HDF5. There is a hack I can do to get a cortex graph, but for the moment I have been working on other avenues as this is very time consuming.

Thanks for the ideas and discussion! I think the node visiting is a great idea.


For the long-running candidate region I have been prodding at I've found out some more information. If I change the distance for merging candidate regions to 15 (instead of 22) then we have no runtime issues. This causes the region causing the long runtime to be broken into three separate regions and each of these completes quickly.

Digging in a little deeper I suspect it is the region near the beginning of the original longer region that is causing problems. When I zoom in on that region it has a GC content of ~90%!! https://bacteria.ensembl.org/Mycobacterium_tuberculosis_h37rv/Location/View?db=core;g=Rv3508;r=Chromosome:3932920-3933052;t=CCP46330

Mycobacterium_tuberculosis_H37Rv_Chromosome3932920_3933052

If I clean the GATB graph (with merge distance at the original 22) discover does complete on that long runtime region in 6 minutes, but abandons that region as there are too many paths.

If I use a discover kmer size of 15 (with the original merge distance) it completes quickly also, but can't find a path between any anchor combination for that region.

Conclusion

I am going to look at the impact of reducing the merging distance to 15, the impact of cleaning the GATB graph, and the impact of --discover-k 15 all on precision/recall

leoisl commented 3 years ago

Nice investigation, this cleared a lot of doubts! Anyway, I think you will be able to tweak some parameters here and there, and will be able to run de novo discovery on these samples spending a short amount of time, but I think this issue might still come up to us in the future or to a user.

Should we still think on a way to flag a graph as too complex, and on strategies to deal with it (i.e. try to automatise what Michael is doing)? I think simply skipping the region, and warning the user how many regions were skipped can be already a good first solution. Other solutions include, I think, doing what you will test: reducing the merging distance and check if the subgraphs produced due to this are not complex anymore, or cleaning, or increasing k (it will depend on what worked for you)

mbhall88 commented 3 years ago

Update from overnight: I reran the pipeline with --discover-k 11 --merge-dist 15. This completed all but one sample in perfectly reasonable time and actually gave use a recall and precision boost compared to --merge-dist 22. I am now trying with --clean-dbg --discover-k 13 --merge-dist 15 to see what impact (if any) cleaning the GATB graphs has on results.


Re: our ongoing discussion about the PE_PGRS54 locus

I think we actually make this complex region even more difficult for ourselves due to the way we select variants when building our Mtb PRG (https://github.com/mbhall88/head_to_head_pipeline/issues/10).

We apply some filtering to the VCFs for the sparse and dense PRGs - see https://github.com/mbhall88/head_to_head_pipeline/blob/0323cbb8b088c357e39d62489721682e0fc8b707/data/H37Rv_PRG/Snakefile#L262-L298

The two bits of filtering that are important for the current issue are:

Now, normally, these are totally sane filters. But, we know that pe/ppe genes are commonly inside these masks. (I also include FILTER in this as the clockwork VCFs have a FILTER class for positions in the mask). So whilst we think we're doing ourselves a favour by not including variants in the mask, we are actually removing complexity from these regions, with the end results being de novo is struggling as it is being triggered a lot and has next to no prioir information in the PRG (this is actually almost the same issue that was being caused by https://github.com/mbhall88/head_to_head_pipeline/issues/53 and https://github.com/mbhall88/head_to_head_pipeline/issues/54). To back up this claim, there is only 2 variants in the PRG for this PE_PGRS54 locus, but lots of variants in the CRyPTIC VCFs that look reasonable but are all in the mask.

Summary

I don't think we should be masking at the PRG construction stage - this should be a downstream process.

mbhall88 commented 3 years ago

Update from overnight: I reran the pipeline with --discover-k 11 --merge-dist 15. This completed all but one sample in perfectly reasonable time and actually gave use a recall and precision boost compared to --merge-dist 22. I am now trying with --clean-dbg --discover-k 13 --merge-dist 15 to see what impact (if any) cleaning the GATB graphs has on results.

Cleaning the local assembly (GATB) graph seems to hit us pretty hard in terms of recall.

The fourth pair of boxes from the left is the run with cleaning

image

However, we do gain precision - but bear in mind this is without any post-filtering.

image


It is beginning to look like the de novo k-mer size isn't as influential as I first thought (at least for Mtb). I'm still expolring parameters though.

leoisl commented 3 years ago

That looks like the classic cleaning DBGs results: - recall, + precision. It is interesting nonetheless, I'd propose to add this to your Mtb paper (at least as supplementary... not sure if it fits the paper though... but I bet there will be use cases where we prefer better precision than recall, and this shows a way of doing it...)

iqbal-lab commented 3 years ago
  1. How is recall better with no prg?
  2. Interesting about clean dbg,what does it precisely mean?
  3. All of this is with abundance threshold at default?
mbhall88 commented 3 years ago
1. How is recall better with no prg?

That's COMPASS

2. Interesting about clean dbg,what does it precisely mean?

https://github.com/rmcolq/pandora/blob/0f1210e6927b905c0a1fe03612ed067d73471851/src/denovo_discovery/local_assembly.cpp#L256-L272

3. All of this is with abundance threshold at default?

The abundance threshold is indicated by abdunx in the names on the x-axis

mbhall88 commented 3 years ago

I am now confident I have explored the discover parameters thoroughly enough. Based on the following recall plot, I am going to put in a PR to change the default pandora discover parameters to --discover-k 15 --merge-dist 15 -L 30 (the boxes on the far right of the plot below). In addition I will remove the --padding option and hard-code it to twice the discover-k value

image

From the point of view of performance, the summary statistics of wall-time for the discover jobs for the newly proposed defaults (with 8 threads) are:

count    600.000000
mean       0.730133
std        0.378366
min        0.480000
25%        0.570000
50%        0.630000
75%        0.712500
max        4.330000