starsimhub / starsim

Starsim disease modeling framework

http://starsim.org

MIT License

14 stars 8 forks source link

Modulus-based random numbers induces unwanted correlations #507

Closed daniel-klein closed 3 months ago

daniel-klein commented 5 months ago

We use r = (r1 + r2) % 1 for transmission, where r1 is a random number from the first agent and r2 is a random number from the second agent. r is indeed uniformly distributed on the interval from [0,1). However, when reusing the same per-agent random draws to test transmission on multiple network edges, we encounter correlations. This makes intuitive sense because a network can have order N^2 edges, but we're only drawing order N random numbers.

The correlation can be seen with as few as N=4 nodes:

import numpy as np

N = 4
p = 0.5
R = 1000000

r = np.random.rand(4, R)
for method in ['indep', 'modulo']:
    if method=='modulo':
        p01 = (r[0,:] + r[1,:]) % 1
        p02 = (r[0,:] + r[2,:]) % 1
        p23 = (r[2,:] + r[3,:]) % 1
        p13 = (r[1,:] + r[3,:]) % 1
    else:
        p01 = r[0,:]
        p02 = r[1,:]
        p23 = r[2,:]
        p13 = r[3,:]

    alone = np.count_nonzero(p13<p) / R
    combined = np.count_nonzero((p13<p) & (p01<p) & (p02<p) & (p23<p)) / np.count_nonzero((p01<p) & (p02<p) & (p23<p))
    print(method, alone, combined)

Output shows the bias:

indep 0.50018 0.5008594843094144
modulo 0.49979 0.668084936217859

daniel-klein commented 5 months ago

I propose that we restore the previous node-based acquisition method.

cliffckerr commented 5 months ago

I think this is a tricky issue. There are three things I'm still unclear on:

Whether this would be an issue in practice for large networks;
Whether the node-based acquisition method actually solves the issue;
How this issue relates to constructing CRN-safe networks in the first place.

1. Does it matter in practice?

The induced correlations between pairs of agents are proportional to 1/√n. With 1000 agents, they're on the order of 1-3%:

import numpy as np
import sciris as sc
import pylab as pl

n = 1000

r1 = np.random.rand(n,n)
v2a = np.random.rand(n)
v2b = np.random.rand(n).reshape((n,1))
r2a = np.tile(v2a, (n,1))
r2b = np.tile(v2b, (1,n))
r2 = (r2a + r2b) % 1.0

sc.options(dpi=200)
pl.figure(figsize=(12,8))
for i,r in enumerate([r1, r2]):

    mean0 = r.mean(axis=0)
    mean1 = r.mean(axis=1)
    std0 = mean0.std()
    std1 = mean1.std()

    if i == 0:
        x0 = x1 = np.arange(n)
        title = 'Truly random'
        xlabel = 'Index'
    else:
        x0 = v2a
        x1 = v2b
        title = 'Modulus approach'
        xlabel = 'Value'

    pl.subplot(2,3,1+i*3)
    pl.imshow(r)
    pl.title(title)

    pl.subplot(2,3,2+i*3)
    pl.scatter(x0, mean0)
    pl.xlabel(xlabel)
    pl.title(f'Array mean, axis=0; std={std0:n}')

    pl.subplot(2,3,3+i*3)
    pl.scatter(x1, mean1)
    pl.xlabel(xlabel)
    pl.title(f'Array mean, axis=1; std={std1:n}')

sc.figlayout()
pl.show()

So with a more realistic sim size -- 100,000 agents, say -- the correlations would only be about 0.2%. That might still matter if it were affecting the mean, but I'm not sure that I can think of a case where a small correlation between pairs of agents would matter. (Also: I confess I still don't understand the mathematical reason for where these correlations are coming from; I don't see a reason why, say, (0.1+r)%1, (0.5+r)%1, or (0.9+r)%1 should be distributed diffrently from each other.)

2. Would the node-based approach solve this?

The modulus method is drawing the same number of bits' worth of random numbers as the node-based approach -- 2N rather than N². So I believe that the node-based approach would have the same induced correlations; it's just that they're more hidden since the per-edge probability is never explicitly calculated.

To be more specific, going back to the node-based transmission code (commit 46addb1), the probability per node is calculated as p_acq_node = 1-np.prod(1-probs, axis=1), and then a single random number is drawn against this (cases = r[people.slot[uids]] < p_acq_node). So this will induce correlations between pairs of agents, just as with the modulus-based approach -- in fact the correlations should be even greater, since we're only using N random numbers to calculate transmission events, instead of 2N. (The other N random numbers are used to allocate a source agent if a transmission occurs.)

3. Can we solve networks?

CRN is based on the assumption that each agent carries around their own personal RNGs, which they use to collapse probabilities into events. But this assumption fails for interactions, since there are O(N²) pairwise interactions, whereas we have only O(N) RNGs. This is why the only CRN-safe network currently is EmbeddingNet, which is deterministic.

Is this a problem? I'm not sure. Where does the source of randomness in an interaction between two agents come from, other than the two agents themselves? We intentionally induce correlations all the time (e.g. using np.random.choice() to guarantee a node degree). So while assuming that the stochasticity between pairs of agents is due to 2N rather than N² independent sources is a different assumption, it's not clear to me that it maps less well onto the real world. In fact: could use the modulus approach to make the networks CRN safe??

Appendix: quantifying the difference

It's a bit hard to find an at-scale statistic that shows the issue; the clearest illustration I've come up with is that if you compute a histogram of the differences between adjacent points, it's very clearly different:

import numpy as np
import sciris as sc
import pylab as pl

nreps = 3
n = 100
density = 1.0
sc.options(dpi=200)
pl.figure(figsize=(3*nreps, 10))
modulo = 'Modulus'
truly = 'Truly random'
kw = dict(alpha=0.7)

def diff_hist(arr, thresh=1e-3):
    a1diff = np.abs(np.diff(arr, axis=0))
    a2diff = np.abs(np.diff(arr, axis=1))
    vals = sc.cat(a1diff.flatten(), a2diff.flatten())
    vals = sc.sanitize(vals, replacenans=False)
    vals = vals[vals>thresh]
    hist, edges = np.histogram(vals, bins=n)
    return edges[:-1], hist

for rep in range(nreps):
    r1 = np.random.rand(n,n)
    v2a = np.random.rand(n)
    v2b = np.random.rand(n).reshape((n,1))
    r2a = np.tile(v2a, (n,1))
    r2b = np.tile(v2b, (1,n))
    r2 = (r2a + r2b) % 1.0

    mask = np.random.rand(n,n) > density
    r1[mask] = np.nan
    r2[mask] = np.nan

    pl.subplot(4,nreps,rep+nreps*0+1)
    pl.plot(v2a, 'o-', **kw)
    pl.plot(v2b, 'o-', **kw)
    pl.title('Input arrays for modulus')

    pl.subplot(4,nreps,rep+nreps*1+1)
    pl.imshow(r1)
    pl.title(truly)

    pl.subplot(4,nreps,rep+nreps*2+1)
    pl.imshow(r2)
    pl.title(modulo)

    pl.subplot(4,nreps,rep+nreps*3+1)
    pl.plot(*diff_hist(r2), 'o-', label=modulo, **kw)
    pl.plot(*diff_hist(r1), 'o-', label=truly, **kw)
    pl.legend()
    pl.title('Histograms of differences')

sc.figlayout()
pl.show()

The truly-random case is a straight line, while the modulus case has large deviations above and below the line. But how best to quantify that?

cliffckerr commented 4 months ago

Yet another investigation

I tried to capture the three methods a different way: in terms of repeated draws, and the probability after all edges are summed (which more accurately captures the transmission process).

Code

Run with python -i rnd_test.py for best results; you might want to change/comment out the sc.options(dpi=200) line.

import numpy as np
import sciris as sc
import pylab as pl

n = 50
seed = 1
reps = 1001
pause = 0.5
nframes = 20
nbins = n

np.random.seed(seed)
all_diffs = sc.objdict(Random=[], Modulo=[], Roulette=[])
row_diffs = sc.dcp(all_diffs)

# Choose logarithmically spaced frames to plot
frames = np.linspace(0, np.log(reps), nframes)
frames = np.exp(frames).astype(int) - 1

sc.options(dpi=200)
fig = pl.figure(figsize=(12,10))
sc.figlayout()

for rep in range(reps):

    # Truly random
    A = np.random.rand(n,n)

    # Modulo
    b1 = np.random.rand(n,1)
    b2 = np.random.rand(1,n)
    B1 = np.tile(b1, (1,n))
    B2 = np.tile(b2, (n,1))
    B = (B1 + B2) % 1.0

    # Roulette
    c1 = np.random.rand(n,1)
    C1 = np.tile(c1, (1,n))
    C2 = np.random.rand(n,n)
    C2 = C2*n/C2.sum(axis=0)
    C = C1*C2

    arrdict = sc.objdict(Random=A, Modulo=B, Roulette=C)

    for i,key,arr in arrdict.enumitems():

        v1 = np.diff(arr, axis=0)
        v2 = np.diff(arr, axis=1)
        v = sc.cat(v1.flatten(), v2.flatten()).tolist()
        all_diffs[key] += v

        rd = np.diff(arr.sum(axis=1)).tolist()
        row_diffs[key] += rd

    if rep in frames:
        fig.clear()
        for i,key,arr in arrdict.enumitems():

            pl.subplot(3,3,1+i*3)
            pl.imshow(arr, cmap='turbo')
            t = f'{key}: rep={rep}\nmean={arr.mean():0.3f}, std={arr.std():0.3f}\nmin={arr.min():0.3f}, max={arr.max():0.3f}'
            pl.title(t)

            pl.subplot(3,3,2+i*3)
            hist, bins = np.histogram(all_diffs[key], bins=nbins)
            w = np.diff(bins).mean()
            pl.bar(bins[:-1], hist, width=w, facecolor='k', lw=0)
            pl.title(f'{key}: all diffs rep={rep}')
            sc.commaticks()

            pl.subplot(3,3,3+i*3)
            rhist, rbins = np.histogram(row_diffs[key], bins=nbins)
            w = np.diff(rbins).mean()
            pl.bar(rbins[:-1], rhist, width=w, facecolor='k', lw=0)
            pl.title(f'{key}: row diffs rep={rep}')
            sc.commaticks()

        pl.pause(pause)

Results

(On the final iteration)

Summary

I believe this shows that in the limit, all three approaches have different statistical distributions. However, all three are similar in terms of first-order statistics. The modulo method has higher-order statistics closer to the truly random method compared to the roulette method.

daniel-klein commented 4 months ago

Alight, I can confirm that the modulo-based approach to transmission does indeed induce unwanted correlations. Code in the crn_paper and copied below:

import os
import networkx as nx
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from pandas.util import hash_pandas_object
import hashlib
import sciris as sc
from pathlib import Path

# PARAMETERS
n = 8 # Number of nodes
n_sources = 2 # Number of sources (seed infections)
reps = 10_000_000
edge_prob = 0.7 # Edge probability
trans_prob = 0.4
seed = 1

lbl = f'n{n}_s{n_sources}_r{reps}_ep{edge_prob}_tp{trans_prob}_seed{seed}'
figdir = os.path.join('figs', 'TransExploration', lbl)
Path(figdir).mkdir(parents=True, exist_ok=True)

sc.options(dpi=200)

def hash(df):
    #return int(hashlib.sha256(hash_pandas_object(df, index=True).values).hexdigest(), 16)
    return hashlib.sha256(hash_pandas_object(df, index=True).values).hexdigest()[:6]

def random(G):
    src = nx.get_node_attributes(G, 'infected')
    infected = []
    tx = []

    # Randomize in case a node is infected multiple times
    el = list(G.edges())
    np.random.shuffle(el)

    for (n1,n2) in el:
        # n1 --> n2
        if src[n1] and (not src[n2]) and (n2 not in infected) and (np.random.rand() < trans_prob):
            tx.append((n1, n2))
            infected.append(n2)
        # n2 --> n1
        if src[n2] and (not src[n1]) and (n1 not in infected) and (np.random.rand() < trans_prob):
            tx.append((n2, n1))
            infected.append(n1)
    return tx

def modulo(G):
    src = nx.get_node_attributes(G, 'infected')
    infected = []
    tx = []
    r1 = np.random.rand(n)
    r2 = np.random.rand(n)
    el = list(G.edges())
    np.random.shuffle(el)
    for (n1,n2) in el:
        # n1 --> n2
        if (src[n1]) and (not src[n2]) and (n2 not in infected) and (((r1[n1]+r2[n2])%1) < trans_prob):
            tx.append((n1, n2))
            infected.append(n2)
        # n2 --> n1
        if (src[n2]) and (not src[n1]) and (n1 not in infected) and (((r1[n2]+r2[n1])%1) < trans_prob):
            tx.append((n2, n1))
            infected.append(n1)
    return tx

def roulette(G):
    src = nx.get_node_attributes(G, 'infected')
    tx = []
    for n2 in G.nodes():
        # All nodes --> n2
        if not src[n2]:
            srcs = [n for n in G.neighbors(n2) if src[n]] #[src[n] for n in G.neighbors(n2)] # Count infected neighbors
            cnt = len(srcs)
            if np.random.rand() < 1-(1-trans_prob)**cnt:
                n1 = np.random.choice(srcs)
                tx.append((n1, n2))
    return tx

def transmit(G, trans_fn):
    # Align transmissions from tx_in if provided
    txs = {}
    counts = {}

    for _ in np.arange(reps):
        txl = trans_fn(G)
        tx = pd.DataFrame(txl, columns=['src', 'dst']).sort_values(['src', 'dst']).reset_index(drop=True)

        h = hash(tx)

        if h not in txs:
            txs[h] = tx

        counts[h] = counts.get(h, 0) + 1

    df = pd.DataFrame(counts.values(), index=pd.Index(counts.keys(), name='Hash'), columns=['Counts'])
    return txs, df

# Build the graph
G = nx.random_graphs.erdos_renyi_graph(n=n, p=edge_prob, seed=seed)

# Seed infections
infected = {i:False for i in range(n)}
sources = np.random.choice(a=range(n), size=n_sources, replace=False)
for source in sources:
    infected[source] = True
nx.set_node_attributes(G, infected, 'infected')

# Do transmissions via each method in parallel
results = sc.parallelize(transmit, iterkwargs=[{'trans_fn':random}, {'trans_fn':modulo}, {'trans_fn':roulette}], kwargs={'G':G}, die=True, serial=False)
tx, cnt = zip(*results)

df = pd.concat(cnt, axis=1) \
    .fillna(0) \
    .astype(int)
df.columns = ['Random', 'Modulo', 'Roulette']

# Manipulate results
df.reset_index(inplace=True)
dfm = df.melt(id_vars='Hash', var_name='Method', value_name='Count')

# Plot
g = sns.barplot(data=dfm, x='Hash', y='Count', hue='Method')
plt.xticks(rotation=90)
g.figure.tight_layout()
sc.savefig(os.path.join(figdir, 'Bar.png'), g.figure)

txc = tx[0].copy()
for i in range(1, len(tx)):
    txc.update(tx[i])
for h, v in txc.items():
    print(f'\nUnique transmission tree #{h}')
    print(v)
sc.savejson(os.path.join(figdir, 'Transmissions.json'), txc)

f = plt.figure(figsize=(12,8))
pos = nx.spring_layout(G, seed=3113794652)  # positions for all nodes
ec = nx.draw_networkx_edges(G, pos, alpha=0.2)
colors = ['red' if infected else 'blue' for infected in nx.get_node_attributes(G, 'infected').values()]
nc = nx.draw_networkx_nodes(G, pos, nodelist=G.nodes(), node_color=colors, node_size=100)
nx.draw_networkx_edges(G, pos, width=1.0, alpha=0.5)
nx.draw_networkx_labels(G, pos)
sc.savefig(os.path.join(figdir, 'Network.png'), f)

plt.show()

This code produces the following figures: Network Fig 1 shows the transmission network consisting of 8 nodes, two of which are initially infected

BarHighRes Fig 2 shows the frequency of transmission trees for each of three transmission algorithms: Random, Modulo, and Roulette. The x-axis is a hash representing each unique transmission pattern, e.g. (1-->4, 5-->0, 5-->7). The transmissions associated with each hash are available in Transmissions.json.

What we observe in the bar frequencies is that Random (blue) and Roulette (green) have very similar frequencies across all possible transmission patterns, but Modulo (orange) has a different frequency pattern. This difference indicates a statistical bias in transmission.

These results come from repeating transmission on the above network 10M times, so it is unlikely that these differences are caused by stochastic fluctuations. (We could perform a statistical test, but no need.)

Note that in my testing, there are some simple configurations in which the modulo approach matches the other two. For example, if testing with just 5 nodes, the output matches the other transmission algorithms. But even with n=8 nodes as shown here, the bias is already clear.

Unless I have made a mistake in the implementation, I fear this study suggests that we need to revert to the "roulette-based" approach to transmission.

cc @cliffckerr

cliffckerr commented 4 months ago

TLDR: you're right, something is going on here!

I adapted your code to plot in a different way; definitely some performance optimizations to be had here (e.g. parallelize by reps instead of by function), but haven't implemented those for now (but did do some Sciris simplifications that I couldn't resist :D):

Results

Code

import numpy as np
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
from pandas.util import hash_pandas_object
import sciris as sc

# PARAMETERS
n = 8 # Number of nodes
n_sources = 2 # Number of sources (seed infections)
reps = 100_000
edge_prob = 0.7 # Edge probability
trans_prob = 0.4
seed = 3

lbl = f'n{n}_s{n_sources}_r{reps}_ep{edge_prob}_tp{trans_prob}_seed{seed}'
figdir = sc.path('figs', 'TransExploration', lbl)
figdir.mkdir(parents=True, exist_ok=True)

sc.options(dpi=200)

def hash(df):
    #return int(hashlib.sha256(hash_pandas_object(df, index=True).values).hexdigest(), 16)
    return sc.sha(hash_pandas_object(df, index=True).values).hexdigest()[:6]

def random(G):
    src = nx.get_node_attributes(G, 'infected')
    infected = []
    tx = []

    # Randomize in case a node is infected multiple times
    el = list(G.edges())
    np.random.shuffle(el)

    for (n1,n2) in el:
        # n1 --> n2
        if src[n1] and (not src[n2]) and (n2 not in infected) and (np.random.rand() < trans_prob):
            tx.append((n1, n2))
            infected.append(n2)
        # n2 --> n1
        if src[n2] and (not src[n1]) and (n1 not in infected) and (np.random.rand() < trans_prob):
            tx.append((n2, n1))
            infected.append(n1)
    return tx

def modulo(G):
    src = nx.get_node_attributes(G, 'infected')
    infected = []
    tx = []
    r1 = np.random.rand(n)
    r2 = np.random.rand(n)
    el = list(G.edges())
    np.random.shuffle(el)
    for (n1,n2) in el:
        # n1 --> n2
        if (src[n1]) and (not src[n2]) and (n2 not in infected) and (((r1[n1]+r2[n2])%1) < trans_prob):
            tx.append((n1, n2))
            infected.append(n2)
        # n2 --> n1
        if (src[n2]) and (not src[n1]) and (n1 not in infected) and (((r1[n2]+r2[n1])%1) < trans_prob):
            tx.append((n2, n1))
            infected.append(n1)
    return tx

def roulette(G):
    src = nx.get_node_attributes(G, 'infected')
    tx = []
    for n2 in G.nodes():
        # All nodes --> n2
        if not src[n2]:
            srcs = [n for n in G.neighbors(n2) if src[n]] #[src[n] for n in G.neighbors(n2)] # Count infected neighbors
            cnt = len(srcs)
            if np.random.rand() < 1-(1-trans_prob)**cnt:
                n1 = np.random.choice(srcs)
                tx.append((n1, n2))
    return tx

def transmit(G, trans_fn):
    # Align transmissions from tx_in if provided
    txs = {}
    counts = {}

    for _ in np.arange(reps):
        txl = trans_fn(G)
        tx = pd.DataFrame(txl, columns=['src', 'dst']).sort_values(['src', 'dst']).reset_index(drop=True)

        h = hash(tx)

        if h not in txs:
            txs[h] = tx

        counts[h] = counts.get(h, 0) + 1

    df = pd.DataFrame(counts.values(), index=pd.Index(counts.keys(), name='Hash'), columns=['Counts'])
    return txs, df

T = sc.timer()

# Build the graph
G = nx.random_graphs.erdos_renyi_graph(n=n, p=edge_prob, seed=seed)

# Seed infections
infected = {i:False for i in range(n)}
sources = np.random.choice(a=range(n), size=n_sources, replace=False)
for source in sources:
    infected[source] = True
nx.set_node_attributes(G, infected, 'infected')

# Do transmissions via each method in parallel
results = sc.parallelize(transmit, iterkwargs=[{'trans_fn':random}, {'trans_fn':modulo}, {'trans_fn':roulette}], kwargs={'G':G}, die=True, serial=False)
tx, cnt = zip(*results)

df = pd.concat(cnt, axis=1) \
    .fillna(0) \
    .astype(int)
df.columns = ['Random', 'Modulo', 'Roulette']

# Manipulate results
df.reset_index(inplace=True)
dfm = df.melt(id_vars='Hash', var_name='Method', value_name='Count')

# Alternate plotting
keys = ['Random', 'Modulo', 'Roulette']
colors = {'Random':'k', 'Modulo':'slategray', 'Roulette':'red'}
z = sc.objdict()
for k in keys:
    z[k] = dfm[dfm.Method==k].Count.values

order = np.argsort(z.Random)

fig = plt.figure()
for k in keys:
    plt.plot(z[k][order], 'o-', alpha=0.7, label=k, c=colors[k])
plt.legend()

f = plt.figure(figsize=(12,8))
pos = nx.spring_layout(G, seed=3113794652)  # positions for all nodes
ec = nx.draw_networkx_edges(G, pos, alpha=0.2)
colors = ['red' if infected else 'blue' for infected in nx.get_node_attributes(G, 'infected').values()]
nc = nx.draw_networkx_nodes(G, pos, nodelist=G.nodes(), node_color=colors, node_size=100)
nx.draw_networkx_edges(G, pos, width=1.0, alpha=0.5)
nx.draw_networkx_labels(G, pos)
sc.savefig(figdir/'Network.png', f)

T.toc()
plt.show()

cliffckerr commented 4 months ago

I still wasn't sure whether these differences would be enough to affect the overall results, but it seems like they are, unfortunately. With a (super)simple ABM approach, the modulo method results in fewer infections, which would be as expected if there are induced correlations between different possible transmission events:

Results

Code

import numpy as np
import sciris as sc
import pylab as pl
# import numba as nb

# @nb.njit # No faster -- not sure why not
def get_probs(n, inds, trg, trans_prob):
    probs = np.ones(n)
    trg_inds = trg[inds]
    for trg_ind in trg_inds:
        probs[trg_ind] *= (1-trans_prob)
    probs = 1 - probs
    return probs

# PARAMETERS
p = sc.objdict(
    n = 100, # Number of nodes
    n_sources = 2, # Number of sources (seed infections)
    edge_prob = 0.1, # Edge probability
    trans_prob = 0.05,
    seed = 6,
    npts = 10,
)

class Sim:
    def __init__(self, p, method='random', repseed=0):
        self.p = p
        self.repseed = repseed
        self.method = method
        self.initialize()
        self.make_network()

    def initialize(self):
        np.random.seed(self.p.seed+self.repseed)
        n = self.p.n
        self.uids = np.arange(n)
        self.inf = np.zeros(n, dtype=int)
        self.inf[0:self.p.n_sources] = 1
        self.xpos = np.random.rand(n)
        self.ypos = np.random.rand(n)
        self.t = 0
        self.tvec = np.arange(p.npts+1)
        self.n_infected = p.n_sources*np.ones(p.npts+1)

    def make_network(self):
        p = self.p
        n_possible = p.n**2
        n_edges = np.random.binomial(n_possible, p.edge_prob)
        p1 = np.random.choice(self.uids, size=n_edges)
        p2 = np.random.choice(self.uids, size=n_edges)
        self.src = sc.cat(p1, p2) # Bidirectional transmission
        self.trg = sc.cat(p2, p1)

    def get_edges(self):
        possible = (self.inf[self.src]==1) * (self.inf[self.trg]==0)
        inds = sc.findinds(possible)
        return inds

    def trans_rand(self, inds):
        r = np.random.rand(len(inds))
        trans = self.p.trans_prob > r
        trans_inds = inds[trans]
        inf_inds = self.trg[trans_inds]
        return inf_inds

    def trans_modulo(self, inds):
        r1 = np.random.rand(self.p.n)
        r2 = np.random.rand(self.p.n)
        # np.random.shuffle(r2)
        r1 = r1[self.src[inds]]
        r2 = r2[self.trg[inds]]
        r = (r1 + r2) % 1.0
        trans = self.p.trans_prob > r
        trans_inds = inds[trans]
        inf_inds = self.trg[trans_inds]
        return inf_inds

    def trans_roulette(self, inds):
        trans_prob = self.p.trans_prob
        n = self.p.n
        trg = self.trg
        probs = get_probs(n, inds, trg, trans_prob)
        r = np.random.rand(self.p.n)
        inf_inds = probs > r
        return inf_inds

    def step(self):
        self.t += 1
        inds = self.get_edges()
        if self.method == 'random':
            inf_inds = self.trans_rand(inds)
        elif self.method == 'roulette':
            inf_inds = self.trans_roulette(inds)
        elif self.method == 'modulo':
            inf_inds = self.trans_modulo(inds)
        else:
            raise Exception('Method not found')
        self.inf[inf_inds] = True
        self.n_infected[self.t] = self.inf.sum()

    def run(self):
        for t in range(self.p.npts):
            self.step()
            if self.n_infected[self.t] == self.p.n:
                self.n_infected[self.t+1:] = self.p.n
                break
        return

    def plot_network(self):
        fig = pl.figure()
        x = self.xpos
        y = self.ypos
        inf = self.inf == 1
        ninf = self.inf == 0
        pl.scatter(x[ninf], y[ninf], c='green')
        pl.scatter(x[inf], y[inf], c='red')
        for p1,p2 in zip(self.src, self.trg):
            pl.plot([x[p1], x[p2]], [y[p1], y[p2]], alpha=0.2, c='k', lw=0.1)
        return fig

    def plot(self):
        fig = pl.figure()
        pl.plot(self.tvec, self.n_infected)
        return fig

def run_sim(sim):
    sim.run()
    return sim

class MultiSim:
    def __init__(self, p, method='random', reps=10):
        self.p = p
        self.method = method
        self.reps = reps
        self.initialize()

    def initialize(self):
        self.sims = [Sim(p=self.p, method=self.method, repseed=rep) for rep in range(self.reps)]

    def run(self):
        self.sims = sc.parallelize(run_sim, self.sims)
        self.results = np.zeros((self.p.npts+1, self.reps))
        for i,sim in enumerate(self.sims):
            self.results[:,i] = sim.n_infected

    def plot_all(self):
        pl.figure()
        pl.plot(self.results, alpha=0.2)

    def plot_stats(self, ax=None):
        if ax is None:
            fig,ax = pl.subplots(1)
        pl.sca(ax)
        qs = [0, 0.0025, 0.01, 0.05, 0.1, 0.25]
        res = self.results
        mean = res.mean(axis=1)
        std = res.std(axis=1)
        x = np.arange(len(mean))
        for q in qs:
            low = np.quantile(res, q, axis=1)
            high = np.quantile(res, 1-q, axis=1)
            pl.fill_between(x, low, high, alpha=0.1, facecolor='red')
        pl.plot(mean, 'o-', c='k', lw=2)
        pl.xlabel('Timestep')
        pl.ylabel('Number infected')
        pl.title(f'Method: {self.method}; stats: μ={mean.mean():0.4f}, σ={std.mean():0.4f}')
        pl.xlim(left=0, right=p.npts)
        pl.ylim(bottom=0, top=p.n)

if __name__ == '__main__':
    sc.options(dpi=200)
    T = sc.timer()
    fig, axes = pl.subplots(1,3, figsize=(18,8))
    for m,method in enumerate(['random', 'roulette', 'modulo']):
        msim = MultiSim(p=p, method=method, reps=1_000)
        T.tic()
        msim.run()
        T.toc(method)
        msim.plot_stats(ax=axes[m])
    sc.figlayout()
    pl.show()

Next step: figure out where exactly these correlations are coming from, and if there's an easy way to prevent them. If not, next step is trying to find a version of the roulette method that has negligible performance penalty.

daniel-klein commented 4 months ago

Thanks for the deep investigation, @cliffckerr!

Agree something is off with the modulo approach for transmission on complex networks (it seems fine on small/simple networks). My explorations to determine the source of the correlation led to pyramid and larger structures of connected nodes wherein non-random patterns result from reusing random draws along closed loops of 4+ nodes. But it wasn't a full investigation.

Regarding performance, my final attempt at the "roulette" (that's not the right name for this approach) implementation seemed to have performance on par with random transmission. I don't recall seeing much of a performance penalty, as with the earlier implementations. Again, wasn't a full analysis, but it didn't seem too bad. My understanding is that the desire to replace it with the modulo approach was more about code simplicity, but it has been long enough now that I don't recall if there was also a performance difference.

I have started reimplementing the "acquisition-based" approach (better name?), should have time this quite week to finish and then we can look at performance in PR.

cliffckerr commented 4 months ago

Now I'm confused again. I would've expected the matrix to have the right mean, but the row-sum counts to be off due to induced correlations. But they match!

These results compute the NxN transmission matrix for the truly-random and modulo case. Rows are targets, columns are sources. A target agent is "infected" if at least one edge (entry in that row) is above the threshold (i.e. if the sum of the row is greater than 0).

Based on the investigations above, I had expected that the statistics would match for the full matrix in terms of the number of above-threshold edges, but that the number of infected agents wouldn't match due to induced correlations (N^2 vs. 2N). But in fact they both match:

This makes me very confused, because now I'm not sure what was causing the difference in the "ABM" results above. Conceptually I thought this case and that case were equivalent.

"""
What -- this shows no difference in the number of infections!!
"""

import numpy as np
import pylab as pl
import sciris as sc

n = 100
reps = 10_000
nc = 4
thresh = 1-(0.5/n)
sc.options(dpi=200)
pl.figure(figsize=(16,12))

d = sc.objdict(defaultdict=sc.autolist)

def matsort(arr):
    s0 = arr.sum(axis=0)
    s1 = arr.sum(axis=1)
    o0 = np.argsort(s0)
    o1 = np.argsort(s1)
    arr = arr[o1,:][:,o0]
    return arr

for rep in range(reps):

    # Truly random
    A = np.random.rand(n,n)
    A = matsort(A)

    # Modulo
    b1 = np.random.rand(n,1)
    b2 = np.random.rand(1,n)
    B1 = np.tile(b1, (1,n))
    B2 = np.tile(b2, (n,1))
    B = (B1 + B2) % 1.0
    B = matsort(B)

    # Thresholded
    At = A>thresh
    Bt = B>thresh
    v = sc.objdict()
    v.Asum = A.sum()
    v.Bsum = B.sum()
    v.Atsum = (At.sum(axis=1) > 0).sum()
    v.Btsum = (Bt.sum(axis=1) > 0).sum()

    for k,val in v.items():
        d[k] += val

#%% Plotting
for i,k,val in d.enumitems():
    pl.subplot(2,2,i+1)
    pl.hist(val, bins=int(np.sqrt(reps)))
    pl.title(f'k {np.array(val).mean():n}')

sc.figlayout()
pl.show()

cliffckerr commented 4 months ago

And indeed, there is no measurable impact on Starsim results between v0.5.2 (modulo) and v0.5.3 (roulette).

This script stores 10,000 simulations from each branch (modulo vs. roulette) and compares the outputs (specifically sim.summary). All the results match to within ±2 SEM (even without doing a correction for multiple comparisons):

Code:

"""
To use:
1. Run this script on `main` to generate `rng_stat_modulo.json`
2. Change to `revert-transmission` and run the script again to generate `rng_stat_roulette.json`
3. Run a third time to plot the results

This script should take about 2 minutes to run on `main` and 10 minutes to run on `revert-transmission`.
"""
import os
import numpy as np
import sciris as sc
import starsim as ss
import pylab as pl

reps = 10_000
keys = {'0.5.2':'modulo', '0.5.3':'roulette'}
filenames = {k:f'rng_stat_{k}.json' for k in keys.values()}

def make_sim(seed):
    pars = dict(
        rand_seed = seed,
        n_agents = 1000,
        end = 2020,
        diseases = ['sir', 'sis'],
        networks = 'random',
        verbose = 0,
    )
    sim = ss.Sim(pars)
    return sim

def run_sim_summary(seed):
    sim = make_sim(seed)
    sim.run()
    return sim.summary

def plot_sim():
    sim = make_sim()
    fig = sim.run().plot()
    return fig

def run_sims():
    ver = ss.__version__
    sc.heading(f'Running {ver}...')
    T = sc.timer()
    reslist = sc.parallelize(run_sim_summary, np.arange(reps))
    res = dict()
    for k in reslist[0].keys():
        res[k] = []
    for entry in reslist:
        for k in res.keys():
            res[k].append(entry[k])
    key = keys[ver]
    filename = filenames[key]
    sc.savejson(filename, res)
    T.toc()
    return

def get_raw():
    raw = sc.objdict()
    for k,filename in filenames.items():
        if not os.path.exists(filename):
            run_sims()
            print('Now change branch (if needed) and re-run')
            break
        raw[k] = sc.loadjson(filename)
    return raw

def get_res(raw):
    res = sc.objdict()
    for k,v in raw.items():
        res[k] = sc.objdict()
        for k2,v2 in v.items():
            mean = np.mean(v2)
            sem = np.std(v2)/np.sqrt(len(v2))
            res[k][k2] = sc.objdict(mean=mean, sem=sem)
    return res

def plot(raw, res):
    keys = res[0].keys()
    nkeys = len(keys)

    sc.options(dpi=150)
    fig = pl.figure(figsize=(16,16))
    count = 0
    for i,k in enumerate(keys):
        mod = res.modulo[k]
        rou = res.roulette[k]
        z = 2
        modmin = mod.mean - z*mod.sem
        modmax = mod.mean + z*mod.sem
        roumin = rou.mean - z*rou.sem
        roumax = rou.mean + z*rou.sem
        statsig = True if (modmax < roumin) or (modmin > roumax) else False
        for i2,k2 in enumerate(res.keys()):
            count += 1
            this = res[k2][k]
            title = f'{k2}\n{k}\n{this.mean:n}±{2*this.sem:n}'
            if statsig:
                title += '\n(mismatch!)'
            pl.subplot(nkeys//2,4,count)
            pl.hist(raw[k2][k], bins=int(np.sqrt(reps)))
            pl.title(title)

    sc.figlayout()
    pl.show()
    return fig

if __name__ == '__main__':
    raw = get_raw()
    res = get_res(raw)
    fig = plot(raw, res)

I feel we're still not quite at the bottom of this, but I'm feeling reassured that there doesn't seem to be any difference in the final results, even though I agree that some of the intermediate calculation results differ.

daniel-klein commented 4 months ago

I suspect the correlations of the modulo approach are lost in the random mixing network in the above examples.

RomeshA commented 4 months ago

I had a quick look at this, my impression is that the issue is that the output of (a+b)%1 is continuous in the sense that a small change to a means there is a small change in (a+b)%1. So if you take Cliff's matrix, and sort it by the values of the first random, you get something like

If that's the case, then we might have better results with a function that behaves more like a hash, where a small difference in a or b results in large change to the output. Just by trial and error, something like this might be suitable:

def f(a,b):
    a1 = (a*1e9).astype(int)
    b1 = (b*1e9).astype(int)
    c = np.bitwise_xor(a1*b1, a1-b1)
    c = c.astype(float)
    c = c-c.min()
    c = c/c.max()
    return c

For Cliff's test this gives

The sorted matrix looks like

For Dan's original test

indep 0.49965 0.49917657968789975
modulo 0.500639 0.6689087176373789
xor 0.500523 0.5012711661150285

For Cliff's network

The distribution does seem pretty uniform

    a = np.random.rand(1000000)
    b = np.random.rand(1000000)
    a1 = (a*1e9).astype(int)
    b1 = (b*1e9).astype(int)
    c = np.bitwise_xor(a1*b1, a1-b1)
    c = c.astype(float)
    c = c-c.min()
    c = c/c.max()
    x,y = np.histogram(a, 1000)
    pl.plot(y[1:],x, label='a', alpha=0.5)
    x,y = np.histogram(b, 1000)
    pl.plot(y[1:],x, label='b', alpha=0.5)
    x,y = np.histogram(c, 1000)
    pl.plot(y[1:],x, label='c', alpha=0.5)
    pl.legend()

And performance wise it's actually a little faster than the original modulus code. I haven't had time to try all of the other tests above and admittedly the selection of that function is a bit arbitrary - but this might be a promising avenue to explore

cliffckerr commented 4 months ago

@RomeshA , this is amazing.

cliffckerr commented 4 months ago

Although ... it looks like the resultant distribution isn't uniformly distributed?