Parallel estimation of covariance matrix

joezuntz commented 2 years ago

For the CosmoDC2 analysis estimate_multi_cov was taking two hours, being single threaded.

This adds two options for parallel calculations, SMP or MPI. The user can pass in either a number of processes or a communicator. The difference processes each generate different rows of the design matrix, and the total is then summed together.

This involves moving the pair generation inside the design matrix calculation, because the generators can't be pickled.

In the associated tests the tolerances for the bootstrap calculation have to be low because there is a random component to the test. There doesn't seem to be a way to reset the RNG to produce the same indices for different processes, at least in the SMP case.

joezuntz commented 2 years ago

@rmjarvis This is passing all tests at last!

joezuntz commented 2 years ago

Thanks! I'll have a go at this.

joezuntz commented 2 years ago

Hi Mike,

I had a quick go at the pair iterators, and have done two of them (see code below) but haven't quite got capacity to do the other two since they seems a bit more complex. So I'd like to take you up on your offer to look at this later!

I've addressed your other notes I hope.

import collections
import numpy as np
from treecorr.util import lazy_property

class PairIterator(collections.abc.Iterator):
    def __init__(self, results, npatch1, npatch2, index):
        self.results = results
        self.npatch1 = npatch1
        self.npatch2 = npatch2
        self.index = index

    def __iter__(self):
        self.gen = iter(self.make_gen())
        return self

    def __next__(self):
        return next(self.gen)

    @classmethod
    def make_pair_lists(cls, corrs):
        return [[cls(c.results, c.npatch1, c.npatch2, i) for i in range(c.npatch1)] for c in corrs]

class JackknifePairIterator(PairIterator):
    def make_gen(self):
        if self.npatch2 == 1:
            gen = ((j,k) for j,k in self.results.keys() if j!=self.index)
        elif self.npatch1 == 1:
            gen = ((j,k) for j,k in self.results.keys() if k!=self.index)
        else:
            # For each i:
            #    Select all pairs where neither is i.
            assert self.npatch1 == self.npatch2
            gen = ((j,k) for j,k in self.results.keys() if j!=self.index and k!=self.index)
        return gen

class SamplePairIterator(PairIterator):
    def make_gen(self):
        if self.npatch2 == 1:
            # k = 0 here
            gen = ((j,k) for j,k in self.results.keys() if j==self.index)
        elif self.npatch1 == 1:
            # j = 0 here
            gen = ((j,k) for j,k in self.results.keys() if k==self.index)
        else:
            assert self.npatch1 == self.npatch2
            # Note: It's not obvious to me a priori which of these should be the right choice.
            #       Empirically, they both underestimate the variance, but the second one
            #       does so less on the tests I have in test_patch.py.  So that's the one I'm
            #       using.
            # For each i:
            #    Select all pairs where either is i.
            #vpairs = [ [(j,k) for j,k in self.results.keys() if j==i or k==i]
            #           for i in range(self.npatch1) ]
            # For each i:
            #    Select all pairs where first is i.
            gen = ((j,k) for j,k in self.results.keys() if j==self.index)

        return gen

rmjarvis commented 2 years ago

Thanks Joe!

rmjarvis / TreeCorr

Parallel estimation of covariance matrix #138