tskit-dev / tszip

Gzip-like compression for tskit tree sequences
https://tszip.readthedocs.io/
MIT License
4 stars 7 forks source link

Support opening from URLs #66

Open benjeffery opened 2 years ago

benjeffery commented 2 years ago

Discussion at https://github.com/tskit-dev/tskit/issues/1566#issuecomment-1148407759

hyanwong commented 2 years ago

If this is implemented, perhaps we should put a note in the tskit docs for tskit.load and ts.dump to say that these methods are primarily intended for use on local files, and if your intention is to make a tree sequence file available for download on the internet, or to download one remotely, you are recommended to use tszip to (de)compress and load from URLs?

jeromekelleher commented 2 years ago

To implement this we'd need to

  1. Update the load_zarr to read from a path or a file. If it'sa file, we'd have to copy first to a local file and then feed the path to zarr.ZipStore (as this is all it supports).
  2. Maybe use fsspec to do URL loading for us (as this is already a dependency of zarr)

It'll be fiddly, unfortunately.

hyanwong commented 2 years ago

To increase the fiddlyness, it would be really helpful, I think, to be able to show progress when downloading, if at all possible. Even if we don't know the file size beforehand, something that tells the user that the session hasn't just stalled is pretty useful for teaching purposes.

jeromekelleher commented 2 years ago

That's surely feature creep - why not put in a bash cell that does the download to a local file using curl?

hyanwong commented 2 years ago

ISWYM about feature creep. But how many tskit users (not devs) know about curl and bash? And do we even want them to know about that before they get started? We provide progress bars for tsinfer to give feedback too.

I guess this could be a Zarr thing anyway. Presumably remote access to data, and feedback about time to complete is on their agenda?

jeromekelleher commented 2 years ago

ISWYM about feature creep. But how many tskit users (not devs) know about curl and bash?

They don't need to for your use case though right, either way it's just a cell in the notebook that they execute which leads to you having a TreeSequence object loaded.

hyanwong commented 2 years ago

Yep, at the moment I'm just doing this in a cell:

import urllib.request
from tqdm import tqdm
import tszip

class DownloadProgressBar(tqdm):
    def update_to(self, b=1, bsize=1, tsize=None):
        if tsize is not None:
            self.total = tsize
        self.update(b * bsize - self.n)

url = "https://zenodo.org/record/5512994/files/hgdp_tgp_sgdp_high_cov_ancients_chr2_q.dated.trees.tsz"
with DownloadProgressBar(unit='B', unit_scale=True,
        miniters=1, desc=url.split('/')[-1]) as t:
    temporary_filename, _ = urllib.request.urlretrieve(url, reporthook=t.update_to)
ts_2q = tszip.decompress(temporary_filename)
urllib.request.urlcleanup() # remove temporary_filename

But it would be much cleaner to wrap that somehow

ts_2q = tszip.decompress(url=url)
jeromekelleher commented 2 years ago

I just tried out by bash magic idea and it'd didn't work because there's no "live" update from the cell, and so you only get the download progress at the very end. So you would have to do this via a python package of some sort.

hyanwong commented 2 years ago

The tqdm code above works a treat. But it's still a bit verbose, and users might baulk at having to understand it. It's not that satisfying to say "just paste this code and ignore how it works". So anything that would help wrap this into a more terse and comprehensible syntax would be good, I think. Perhaps @benjeffery has a good suggestion (he usually does!). Personally I don't think it's too bad to have tqdm as a tszip dependency. You could imagine, for instance, defining something like the DownloadProgressBar class as a tszip helper:

url = "https://zenodo.org/record/5512994/files/hgdp_tgp_sgdp_high_cov_ancients_chr2_q.dated.trees.tsz"
with tszip.progressbar() as pbar:
    tmpname, _ = urllib.request.urlretrieve(url, reporthook=pbar.update)
    ts = tszip.decompress(tmpname)
    urllib.request.urlcleanup()

is already a lot cleaner IMO. But maybe there is an even terser way to do it?

jeromekelleher commented 2 years ago

It makes no sense to add a general progress bar UI to a package that's for compressing tskit tree sequences. What you're looking for is a python package that does a download with an integrated progress bar (which I agree would be very useful):

import yanspackage

url = "https://zenodo.org/record/5512994/files/hgdp_tgp_sgdp_high_cov_ancients_chr2_q.dated.trees.tsz"
filename = yanspackage.download(url, progress="notebook")
hyanwong commented 2 years ago

OK, but I'm thinking of tszip as not just a compression package, but "a package for compressing and decompressing tree sequences, including from remote sites". Maybe that's feature creep, but as I say, it would be useful for teaching (and probably research too).

(FWIW for learning stuff, I rather dislike having to download files separately, before doing analysis, then fiddling around with coding where the files are stored, etc. I would much prefer it to appear as if I have streamed the download directly into the variables in my python session, and not have to think about clearing up disk space afterwards, or dealing with tmp directories. Perhaps I'm unusual like that, though?)