popsim-consortium / demes-paper

Manuscript for the demes package for demographic model specification.
1 stars 6 forks source link

Discussion of compression of YAMLs plus multidoc #21

Closed jeromekelleher closed 3 years ago

jeromekelleher commented 3 years ago

Maybe mention that this will compress well/easily due to the information redundancy. Could also mention multi-document YAML files here. We should definitely hack some support for this into demes-python to get an idea of what multidoc files look like from an API perspective.

Digression: assuming no compression, I expect one multi-document file will be smaller than an equivalent number of separate files, because the space consumed by a file is an integer multiple of the filesystem's block size and practical models are very likely smaller than the filesystem's block size (often 4kb).

_Originally posted by @grahamgower in https://github.com/popsim-consortium/demes-paper/pull/20#discussion_r655141925_

jeromekelleher commented 3 years ago

So, what I think we want for the paper is a quick exploration of these issues one a realistic-ish example.

Any ideas of what we should do?

grahamgower commented 3 years ago

Here's some code to get the ball rolling.

# multidoc.py
import sys
import itertools

import numpy as np
import demes

def piecewise_constant(rng, max_epochs=40, T_max=1_000_000, N_min=10, N_max=1_000_000):
    """
    One deme model with piecewise-constant sizes. The number of epochs
    and their time spans are random variables, as are the sizes in each epoch.
    Deme sizes in each epoch are simulated as a random walk.
    """
    num_epochs = rng.integers(low=2, high=max_epochs)
    T = None
    while T is None or len(T) > len(set(T)):
        T = rng.integers(low=1, high=T_max, size=num_epochs - 1)
    T = sorted(T, reverse=True) + [0]

    N = [rng.integers(low=N_min, high=N_max)]
    for j in range(num_epochs - 1):
        N_low = max(N_min, N[j] / 2)
        N_high = min(N_max, N[j] * 2)
        N.append(rng.integers(low=N_low, high=N_high))

    assert len(T) == num_epochs
    assert len(N) == num_epochs

    b = demes.Builder()
    b.add_deme(
        "A",
        epochs=[dict(start_size=size, end_time=time) for size, time in zip(N, T)],
    )
    return b.resolve()

def IM(rng, num_demes=3, N_min=10, N_max=100_000, T_max=100_000, mig_max=0.05):
    """
    Isolation with migration model where the split time,
    the (constant) population size, and the (symmetric) migration rate
    are random variables.
    """
    N = rng.integers(low=N_min, high=N_max)
    T = rng.integers(low=1, high=T_max)
    mig_rate = rng.uniform(low=0, high=mig_max)
    b = demes.Builder()
    b.add_deme("ancestral", epochs=[dict(start_size=N, end_time=T)])
    for j in range(num_demes):
        b.add_deme(f"deme{j}", ancestors=["ancestral"], epochs=[dict(start_size=N)])
    b.add_migration(demes=[f"deme{j}" for j in range(num_demes)], rate=mig_rate)
    return b.resolve()

if __name__ == "__main__":
    if len(sys.argv) != 2 or sys.argv[1] not in ("read", "write_pc", "write_im"):
        print(f"usage: {sys.argv[0]} {{read | write_pc | write_im}}")
        exit(1)

    cmd = sys.argv[1]
    rng = np.random.default_rng(1234)
    num_reps = 50_000

    if cmd == "read":
        list(demes.load_all(sys.stdin))
    elif cmd == "write_pc":
        models = map(piecewise_constant, itertools.repeat(rng, num_reps))
        demes.dump_all(models, sys.stdout)
    elif cmd == "write_im":
        models = map(IM, itertools.repeat(rng, num_reps))
        demes.dump_all(models, sys.stdout)

Piecewise constant.

t490:tmp $ time python multidoc.py write_pc > pc.yaml

real    0m59,752s
user    0m58,757s
sys     0m1,670s
t490:tmp $ time python multidoc.py read < pc.yaml

real    0m54,024s
user    0m53,894s
sys     0m0,964s
t490:tmp $ gzip -c pc.yaml > pc.yaml.gz
t490:tmp $ ls -l pc.yaml*
-rw-r--r-- 1 grg grg 46588948 Jun 23 17:53 pc.yaml
-rw-r--r-- 1 grg grg  8761466 Jun 23 19:48 pc.yaml.gz
t490:tmp $ ls -lh pc.yaml*
-rw-r--r-- 1 grg grg  45M Jun 23 17:53 pc.yaml
-rw-r--r-- 1 grg grg 8,4M Jun 23 19:48 pc.yaml.gz

Isolation with migration.

t490:tmp $ time python multidoc.py write_im > im.yaml

real    0m40,738s
user    0m39,950s
sys     0m1,627s
t490:tmp $ time python multidoc.py read < im.yaml

real    0m30,290s
user    0m30,277s
sys     0m0,875s
t490:tmp $ gzip -c im.yaml > im.yaml.gz
t490:tmp $ ls -l im.yaml*
-rw-r--r-- 1 grg grg 21853267 Jun 23 19:49 im.yaml
-rw-r--r-- 1 grg grg  1656024 Jun 23 19:51 im.yaml.gz
t490:tmp $ ls -lh im.yaml*
-rw-r--r-- 1 grg grg  21M Jun 23 19:49 im.yaml
-rw-r--r-- 1 grg grg 1,6M Jun 23 19:51 im.yaml.gz
jeromekelleher commented 3 years ago

Very nice, thanks @grahamgower. For comparison, what's the absolute minimum space we could store this information in? I guess for the pc model it's the size of the N and T arrays, which is 2 50_000 4 bytes = ~400K. So the gzipped output is only a small constant more than this, I think?

grahamgower commented 3 years ago

The PC model above has on average ~20 epochs, so 20 * 2 * 50_000 * 4 byes = 8Mb. Those 4-byte integers are also compressible though, so the theoretical minimum size is less than this.

I've also compressed using xz -9e, which achieves a world-class compression ratio (presumably quite close to the theoretical best ratio based on the information content).

t490:tmp $ ls -lh im.yaml*
-rw-r--r-- 1 grg grg  21M Jun 23 19:49 im.yaml
-rw-r--r-- 1 grg grg 1,6M Jun 23 19:51 im.yaml.gz
-rw-r--r-- 1 grg grg 886K Jun 24 11:53 im.yaml.xz
t490:tmp $ ls -lh pc.yaml*
-rw-r--r-- 1 grg grg  45M Jun 23 17:53 pc.yaml
-rw-r--r-- 1 grg grg 8,4M Jun 23 19:48 pc.yaml.gz
-rw-r--r-- 1 grg grg 6,2M Jun 24 11:54 pc.yaml.xz
t490:tmp $ ls -l im.yaml*
-rw-r--r-- 1 grg grg 21853267 Jun 23 19:49 im.yaml
-rw-r--r-- 1 grg grg  1656024 Jun 23 19:51 im.yaml.gz
-rw-r--r-- 1 grg grg   906724 Jun 24 11:53 im.yaml.xz
t490:tmp $ ls -l pc.yaml*
-rw-r--r-- 1 grg grg 46588948 Jun 23 17:53 pc.yaml
-rw-r--r-- 1 grg grg  8761466 Jun 23 19:48 pc.yaml.gz
-rw-r--r-- 1 grg grg  6430972 Jun 24 11:54 pc.yaml.xz
jeromekelleher commented 3 years ago

OK, great. If we're compressing to roughly the same as the minimal raw binary representation then that's pretty good. We're not trying to say this is the most compact possible representation, just that it's pretty good.

jeromekelleher commented 3 years ago

Can you make a quick discussion in the paper discussing this? We're just looking at two or three sentences really. There's a placeholder for it.