superphy / prairiedog

next-gen pangenome graphs for predictive genomics
Other
0 stars 0 forks source link

Pangenome graph creation optimizations #64

Open kevinkle opened 5 years ago

kevinkle commented 5 years ago

Our use of objects instead of strings for graph creation slowed down performance, probably due to the additional allocation.

2019-06-19 14:33:34 panther prairiedog[21158] DEBUG Done graphing SRR3295769.fasta, covering 4964075 kmers in 521.0180611610413 s

this is compared to ~460s from https://github.com/superphy/prairiedog/issues/53#issuecomment-501390129

Performance was much worse (~1000s+) before we changed LGGraph.add_edge() to have an optional Edge object return per the echo arg.

kevinkle commented 5 years ago

Back to 426 after https://github.com/superphy/prairiedog/pull/69

2019-06-20 09:12:06 panther prairiedog[16565] DEBUG Done graphing SRR3295769.fasta, covering 4964075 kmers in 426.83602833747864 s
Current graph size is 1.1517066955566406 GB
2019-06-20 09:19:05 panther prairiedog[16565] DEBUG Done graphing SRR3665189.fasta, covering 4975087 kmers in 415.99421286582947 s
Current graph size is 2.240581512451172 GB
rule 'pangenome' on Kmer 3 / 10
2019-06-20 09:19:08 panther prair
kevinkle commented 5 years ago

We should take a look at different lemongraph options:

kevinkle commented 5 years ago

Should note that nosync just sets MDB_NOSYNC per https://sourcegraph.com/github.com/NationalSecurityAgency/lemongraph/-/blob/lib/db.c#L48 which is probably suitable for our use case. See https://news.ycombinator.com/item?id=18411474

kevinkle commented 5 years ago

This is with the new mapsize, nosync=True,noreadahead=True. seems slower

2019-06-24 10:25:49 panther prairiedog[15246] DEBUG 4700000/4899264, 95%
2019-06-24 10:26:02 panther prairiedog[15246] DEBUG 4800000/4899264, 97%
2019-06-24 10:26:15 panther prairiedog[15246] DEBUG Done graphing SRR2407793.fasta, covering 4899264 kmers in 615.9522776603699 s
Current graph size is 20.06531524658203 GB
rule 'pangenome' on Kmer 12 / 100
kevinkle commented 5 years ago

With old mapsize, other options the same

2019-06-24 10:51:35 panther prairiedog[20555] DEBUG 4900000/4975087, 98%
2019-06-24 10:51:44 panther prairiedog[20555] DEBUG Done graphing SRR3665189.fasta, covering 4975087 kmers in 565.5920946598053 s
Current graph size is 3.167652130126953 GB
rule 'pangenome' on Kmer 3 / 100
2019-06-24 10:42:18 panther prairiedog[20555] DEBUG Done graphing SRR3295769.fasta, covering 4964075 kmers in 574.8278570175171 s
Current graph size is 1.5720138549804688 GB
rule 'pangenome' on Kmer 2 / 100
kevinkle commented 5 years ago

old mapsize, readahead=False, nosync=True

2019-06-24 11:15:45 panther prairiedog[20806] DEBUG Done graphing SRR3665189.fasta, covering 4975087 kmers in 554.5166437625885 s
Current graph size is 3.1686248779296875 GB
rule 'pangenome' on Kmer 3 / 100
kevinkle commented 5 years ago

old mapsize, readahead=False, nosync=False . I wonder if something else changed between here and https://github.com/superphy/prairiedog/pull/69

2019-06-24 11:36:14 panther prairiedog[20943] DEBUG Done graphing SRR3665189.fasta, covering 4975087 kmers in 567.0542962551117 s
Current graph size is 3.1681747436523438 GB
rule 'pangenome' on Kmer 3 / 100
kevinkle commented 5 years ago

From above, we added filename + contig header props to each edge and increased mapsize. Will leave it to run with the profiler and see

kevinkle commented 5 years ago

PyPy with prop metadata for additional genome and contig:

2019-07-03 11:12:40 panther prairiedog[22137] DEBUG 4800000/4800480, 99%
2019-07-03 11:12:40 panther prairiedog[22137] DEBUG Done graphing SRR1060582.fasta, covering 4800480 kmers in 231.11508917808533 s
Current graph size is 1.553009033203125 GB
rule 'pangenome' on Kmer 2 / 2
2019-07-03 11:16:37 panther prairiedog[22137] DEBUG 4800000/4871878, 98%
2019-07-03 11:16:40 panther prairiedog[22137] DEBUG Done graphing SRR3295722.fasta, covering 4871878 kmers in 239.99927473068237 s

Without edge props:

2019-07-03 11:20:41 panther prairiedog[22894] DEBUG 4800000/4800480, 99%
2019-07-03 11:20:41 panther prairiedog[22894] DEBUG Done graphing SRR1060582.fasta, covering 4800480 kmers in 151.0755934715271 s
Current graph size is 1.1490974426269531 GB
2019-07-03 11:23:19 panther prairiedog[22894] DEBUG 4800000/4871878, 98%
2019-07-03 11:23:21 panther prairiedog[22894] DEBUG Done graphing SRR3295722.fasta, covering 4871878 kmers in 159.93281745910645 s