Closed aabiddanda closed 3 years ago
Don't worry about the CI tests failing @aabiddanda, there's a bit of breakage in our CI which I'm sorting out in #45. Just get the tests passing locally and it'll be fine.
Merging #44 (db80be0) into main (55bc8c4) will decrease coverage by
0.59%
. The diff coverage is75.00%
.
@@ Coverage Diff @@
## main #44 +/- ##
==========================================
- Coverage 97.70% 97.10% -0.60%
==========================================
Files 6 6
Lines 305 311 +6
Branches 55 57 +2
==========================================
+ Hits 298 302 +4
- Misses 5 6 +1
- Partials 2 3 +1
Impacted Files | Coverage Δ | |
---|---|---|
tszip/cli.py | 98.01% <75.00%> (-1.99%) |
:arrow_down: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 55bc8c4...db80be0. Read the comment docs.
Is there a clear way in python
to obtain the path to file-like objects for stdout
? Building the tests using outfile = pathlib.Path('/dev/stdout')
was substantially more efficient to avoid fileno
errors.
I had a play with this @aabiddanda and it really is quite fiddly to get stdout diverted correctly here. This is what I came up with:
def capture_output(func, *args, binary=False, **kwargs):
"""
Runs the specified function and arguments, and returns the
tuple (stdout, stderr) as strings.
"""
with tempfile.TemporaryDirectory() as tmpdir:
stdout_path = pathlib.Path(tmpdir) / "stdout"
stderr_path = pathlib.Path(tmpdir) / "stderr"
mode = "wb+" if binary else "w+"
saved_stdout = sys.stdout
saved_stderr = sys.stderr
with open(stdout_path, mode) as stdout, open(stderr_path, mode) as stderr:
try:
sys.stdout = stdout
sys.stderr = stderr
with mock.patch("signal.signal"):
func(*args, **kwargs)
stdout.seek(0)
stderr.seek(0)
stdout_output = stdout.read()
stderr_output = stderr.read()
finally:
sys.stdout = saved_stdout
sys.stderr = saved_stderr
return stdout_output, stderr_output
by using a real file, we avoid the fileno problem and so we can work directly with sys.stdout. We need the binary
argument as otherwise we try to interpret the tskit output as unicode and it borks. There's probably a more elegant way to do it, but this works :smile:
Main TODO left on the testing front is to address the following: """ Add a test to make sure that we're correctly outputting a stream of multiple tree sequences to stdout when we have multiple input files, which can be read by tskit. (tskit.load() should consume them one-by-one from a file). """
I'll be addressing this shortly
Upon some reflection/checking it seems like this type of reading multiple tree-sequences from a single file (compressed or otherwise) is a no-go and this is not a feature in tskit.load. Unless I am missing something very obvious here. I could see two ways to go:
The first seems more achievable quickly ... Example code illustrating this issue:
import msprime
import tskit
import numpy as np
import os
print(f"tskit version: {tskit.__version__}")
print(f"msprime version: {msprime.__version__}")
# simulate two different tree sequences and dump to different text files
ts1 = msprime.simulate(10, mutation_rate=10, random_seed=1)
ts2 = msprime.simulate(20, mutation_rate=10, random_seed=2)
ts1.dump('t1.trees')
ts2.dump('t2.trees')
os.system('cat t1.trees t2.trees > combined.trees')
ts_joint = tskit.load('combined.trees')
# Error: it only matches the first entry ... not both
assert ts_joint.tables == ts1.tables
Based on the C API it seems like it is there but perhaps this feature has not been propagated to the Python API yet by wrapping the loadf
function (at least as of tskit v0.3.7)? Would appreciate any thoughts on this as it seems like a relatively important design choice.
It works all right @aabiddanda, there's just a slight difference in how you get the tree sequences files from the combined one:
os.system('cat t1.trees t2.trees > combined.trees')
with open("combined.trees", "r") as f:
ts1a = tskit.load(f)
ts2a = tskit.load(f)
assert ts1a == ts1
assert ts2a == ts2
So, tskit will read complete tree sequences sequentially from a stream, but if you point it to a file path it'll just read the first.
Ok should be all set to go on this one and its a nice clean additional feature (the tutorial on squashing is very helpful for someone like me that tends to over-commit!). Thanks for the tips & tricks @jeromekelleher!
Excellent, thanks @aabiddanda! Squashing is a real git super-power I think - makes it look like you get everything exactly right first time :wink:
Inspired by
gzip
where one can decompress to stdout I've added a flag to the cli-c
that allows one to decompress the tree-sequence to stdout and redirect to a file of their choosing. This was inspired by a use-case where I wanted to decompress a single simulation to multiple test cases.I'm happy for this edit to be rewritten or a broader discussion of whether this belongs in the CLI.