Open benjeffery opened 2 years ago
If this is implemented, perhaps we should put a note in the tskit
docs for tskit.load
and ts.dump
to say that these methods are primarily intended for use on local files, and if your intention is to make a tree sequence file available for download on the internet, or to download one remotely, you are recommended to use tszip
to (de)compress and load from URLs?
To implement this we'd need to
It'll be fiddly, unfortunately.
To increase the fiddlyness, it would be really helpful, I think, to be able to show progress when downloading, if at all possible. Even if we don't know the file size beforehand, something that tells the user that the session hasn't just stalled is pretty useful for teaching purposes.
That's surely feature creep - why not put in a bash cell that does the download to a local file using curl?
ISWYM about feature creep. But how many tskit users (not devs) know about curl and bash? And do we even want them to know about that before they get started? We provide progress bars for tsinfer to give feedback too.
I guess this could be a Zarr thing anyway. Presumably remote access to data, and feedback about time to complete is on their agenda?
ISWYM about feature creep. But how many tskit users (not devs) know about curl and bash?
They don't need to for your use case though right, either way it's just a cell in the notebook that they execute which leads to you having a TreeSequence object loaded.
Yep, at the moment I'm just doing this in a cell:
import urllib.request
from tqdm import tqdm
import tszip
class DownloadProgressBar(tqdm):
def update_to(self, b=1, bsize=1, tsize=None):
if tsize is not None:
self.total = tsize
self.update(b * bsize - self.n)
url = "https://zenodo.org/record/5512994/files/hgdp_tgp_sgdp_high_cov_ancients_chr2_q.dated.trees.tsz"
with DownloadProgressBar(unit='B', unit_scale=True,
miniters=1, desc=url.split('/')[-1]) as t:
temporary_filename, _ = urllib.request.urlretrieve(url, reporthook=t.update_to)
ts_2q = tszip.decompress(temporary_filename)
urllib.request.urlcleanup() # remove temporary_filename
But it would be much cleaner to wrap that somehow
ts_2q = tszip.decompress(url=url)
I just tried out by bash magic idea and it'd didn't work because there's no "live" update from the cell, and so you only get the download progress at the very end. So you would have to do this via a python package of some sort.
The tqdm code above works a treat. But it's still a bit verbose, and users might baulk at having to understand it. It's not that satisfying to say "just paste this code and ignore how it works". So anything that would help wrap this into a more terse and comprehensible syntax would be good, I think. Perhaps @benjeffery has a good suggestion (he usually does!). Personally I don't think it's too bad to have tqdm as a tszip dependency. You could imagine, for instance, defining something like the DownloadProgressBar class as a tszip helper:
url = "https://zenodo.org/record/5512994/files/hgdp_tgp_sgdp_high_cov_ancients_chr2_q.dated.trees.tsz"
with tszip.progressbar() as pbar:
tmpname, _ = urllib.request.urlretrieve(url, reporthook=pbar.update)
ts = tszip.decompress(tmpname)
urllib.request.urlcleanup()
is already a lot cleaner IMO. But maybe there is an even terser way to do it?
It makes no sense to add a general progress bar UI to a package that's for compressing tskit tree sequences. What you're looking for is a python package that does a download with an integrated progress bar (which I agree would be very useful):
import yanspackage
url = "https://zenodo.org/record/5512994/files/hgdp_tgp_sgdp_high_cov_ancients_chr2_q.dated.trees.tsz"
filename = yanspackage.download(url, progress="notebook")
OK, but I'm thinking of tszip
as not just a compression package, but "a package for compressing and decompressing tree sequences, including from remote sites". Maybe that's feature creep, but as I say, it would be useful for teaching (and probably research too).
(FWIW for learning stuff, I rather dislike having to download files separately, before doing analysis, then fiddling around with coding where the files are stored, etc. I would much prefer it to appear as if I have streamed the download directly into the variables in my python session, and not have to think about clearing up disk space afterwards, or dealing with tmp directories. Perhaps I'm unusual like that, though?)
Discussion at https://github.com/tskit-dev/tskit/issues/1566#issuecomment-1148407759