tskit-dev / tszip

Gzip-like compression for tskit tree sequences
https://tszip.readthedocs.io/
MIT License
4 stars 7 forks source link

cli now supports decompressing to stdout #44

Closed aabiddanda closed 3 years ago

aabiddanda commented 3 years ago

Inspired by gzip where one can decompress to stdout I've added a flag to the cli -c that allows one to decompress the tree-sequence to stdout and redirect to a file of their choosing. This was inspired by a use-case where I wanted to decompress a single simulation to multiple test cases.

I'm happy for this edit to be rewritten or a broader discussion of whether this belongs in the CLI.

jeromekelleher commented 3 years ago

Don't worry about the CI tests failing @aabiddanda, there's a bit of breakage in our CI which I'm sorting out in #45. Just get the tests passing locally and it'll be fine.

codecov-commenter commented 3 years ago

Codecov Report

Merging #44 (db80be0) into main (55bc8c4) will decrease coverage by 0.59%. The diff coverage is 75.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main      #44      +/-   ##
==========================================
- Coverage   97.70%   97.10%   -0.60%     
==========================================
  Files           6        6              
  Lines         305      311       +6     
  Branches       55       57       +2     
==========================================
+ Hits          298      302       +4     
- Misses          5        6       +1     
- Partials        2        3       +1     
Impacted Files Coverage Δ
tszip/cli.py 98.01% <75.00%> (-1.99%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 55bc8c4...db80be0. Read the comment docs.

aabiddanda commented 3 years ago

Is there a clear way in python to obtain the path to file-like objects for stdout? Building the tests using outfile = pathlib.Path('/dev/stdout') was substantially more efficient to avoid fileno errors.

jeromekelleher commented 3 years ago

I had a play with this @aabiddanda and it really is quite fiddly to get stdout diverted correctly here. This is what I came up with:

def capture_output(func, *args, binary=False, **kwargs):
    """
    Runs the specified function and arguments, and returns the
    tuple (stdout, stderr) as strings.
    """
    with tempfile.TemporaryDirectory() as tmpdir:
        stdout_path = pathlib.Path(tmpdir) / "stdout"
        stderr_path = pathlib.Path(tmpdir) / "stderr"
        mode = "wb+" if binary else "w+"
        saved_stdout = sys.stdout
        saved_stderr = sys.stderr
        with open(stdout_path, mode) as stdout, open(stderr_path, mode) as stderr:
            try:
                sys.stdout = stdout
                sys.stderr = stderr
                with mock.patch("signal.signal"):
                    func(*args, **kwargs)
                stdout.seek(0)
                stderr.seek(0)
                stdout_output = stdout.read()
                stderr_output = stderr.read()
            finally:
                sys.stdout = saved_stdout
                sys.stderr = saved_stderr
    return stdout_output, stderr_output

by using a real file, we avoid the fileno problem and so we can work directly with sys.stdout. We need the binary argument as otherwise we try to interpret the tskit output as unicode and it borks. There's probably a more elegant way to do it, but this works :smile:

aabiddanda commented 3 years ago

Main TODO left on the testing front is to address the following: """ Add a test to make sure that we're correctly outputting a stream of multiple tree sequences to stdout when we have multiple input files, which can be read by tskit. (tskit.load() should consume them one-by-one from a file). """

I'll be addressing this shortly

aabiddanda commented 3 years ago

Upon some reflection/checking it seems like this type of reading multiple tree-sequences from a single file (compressed or otherwise) is a no-go and this is not a feature in tskit.load. Unless I am missing something very obvious here. I could see two ways to go:

  1. Put in an error if trying to decompress more than 1 file to stdout (saying that this is not currently supported)
  2. Wait till tskit.load supports loading multiple tree-sequences from the same file to merge this PR in.

The first seems more achievable quickly ... Example code illustrating this issue:

import msprime
import tskit
import numpy as np
import os

print(f"tskit version: {tskit.__version__}")
print(f"msprime version: {msprime.__version__}")

# simulate two different tree sequences and dump to different text files
ts1 = msprime.simulate(10, mutation_rate=10, random_seed=1)
ts2 = msprime.simulate(20, mutation_rate=10, random_seed=2)
ts1.dump('t1.trees')
ts2.dump('t2.trees')

os.system('cat t1.trees t2.trees > combined.trees')
ts_joint = tskit.load('combined.trees')
# Error: it only matches the first entry ... not both
assert ts_joint.tables == ts1.tables

Based on the C API it seems like it is there but perhaps this feature has not been propagated to the Python API yet by wrapping the loadf function (at least as of tskit v0.3.7)? Would appreciate any thoughts on this as it seems like a relatively important design choice.

jeromekelleher commented 3 years ago

It works all right @aabiddanda, there's just a slight difference in how you get the tree sequences files from the combined one:

os.system('cat t1.trees t2.trees > combined.trees')
with open("combined.trees", "r") as f:
    ts1a = tskit.load(f)
    ts2a = tskit.load(f)

assert ts1a == ts1
assert ts2a == ts2

So, tskit will read complete tree sequences sequentially from a stream, but if you point it to a file path it'll just read the first.

aabiddanda commented 3 years ago

Ok should be all set to go on this one and its a nice clean additional feature (the tutorial on squashing is very helpful for someone like me that tends to over-commit!). Thanks for the tips & tricks @jeromekelleher!

jeromekelleher commented 3 years ago

Excellent, thanks @aabiddanda! Squashing is a real git super-power I think - makes it look like you get everything exactly right first time :wink: