Open hyanwong opened 3 years ago
NB: this would mean that content in e.g. jupyterbooks could simply be pasted into the user's jupyter session and (as long as the right libraries were installed), the examples should all run. It's less error-prone and fiddly than trying to run the jupyterbook e.g. in binder (see https://github.com/tskit-dev/tutorials/issues/114)
The right way to do this I think would be to make the string form of the file input support URLs, which we intercept in Python. BUT this would be quite a lot of work to do well (it's simple to do badly) so I'm not super keen on it right now.
Just to say that I have wanted to do this for a while, for other purposes too (not just the tutorials), so I am keen to get something like this done. But maybe others could weigh in if they have needed to do this in the past (as it could just be me).
Out of interest, why would it be a lot of work to do well? I would have thought the standard URL libraries would do most of the hard lifting?
Presumably the workaround for the moment is simply e.g.
import urllib.request
file = urllib.request.urlopen('file:///Users/yan/Documents/GitHub/tutorials/data/basics.trees')
ts = tskit.load(file)
Is there any reason that we shouldn't be doing this as a workaround?
I've been thinking about this, and actually I do think that the form load(url="http://...")
is worth considering, for two reasons. Firstly the silly one, that you can still load a file starting with "http" in this case. Secondly, perhaps a more sensible one, that it makes it much easier to document and code without complications, as you can simply pass the url
parameter to urllib.request.urlopen
. In fact, I think that makes it about a 4 line addition to tskit, and to unit test it you can simply try it with a file:///
url, which should be pretty simple. Then we simply say "anything passed as a url parameter is passed on to urllib.request to open" (or something like that. I think this would tick the box of "doing it well" without having to check on the format of the string, which, as @jeromekelleher points out, is a bit icky.
It's about much more than just the parameters @hyanwong - once you start interacting with the network all sorts of extra complications come in, like http proxies, error handling, retries, etc etc etc. The nominal case is trivial - it's the error and edge cases that are really tricky. These are hard to implement and especially hard to test. I've done it several times and don't particularly want to spend the time doing it now.
import urllib.request
file = urllib.request.urlopen('file:///Users/yan/Documents/GitHub/tutorials/data/basics.trees')
ts = tskit.load(file)
This seems like a perfectly reasonable thing to do to me (assuming it works).
It's about much more than just the parameters @hyanwong - once you start interacting with the network all sorts of extra complications come in, like http proxies, error handling, retries, etc etc etc. The nominal case is trivial - it's the error and edge cases that are really tricky. These are hard to implement and especially hard to test. I've done it several times and don't particularly want to spend the time doing it now.
Right, that sounds reasonable. I thought that we could simply say that we are handing all this stuff over to urllib.request.urlopen
and therefore if it doesn't work (e.g. problems with proxies etc) it's not something we are responsible for. But maybe that's a cop out.
I'd be much more motivated to do this if tskit had any sense of being able to use part of a file, as zarr
for example does. tskit
needs the whole file to do anything, so you have to download it all anyway. The tutorials should also be using the most common way of loading a .trees
- from disk. Adding url
arguments at that point adds confusion.
Pointing out that urlopen
can be used in the load
docs seems a good idea though.
Annoyingly I find that urlopen
doesn't actually work for http(s) urls:
import urllib.request
file = urllib.request.urlopen("https://raw.github.com/tskit-dev/tutorials/main/data/basics.trees")
tskit.load(file)
Gives FileFormatError: File not in KAS format
, presumably because, although file.fileno()
is valid here, which I think is the requirement tested in KAS, it's not actually stored on disk quite like a normal file, and the part of the tskit C library that loads from a file doesn't know what to make of it. So yes, as Jerome says, this is more complex than I had hoped, and can be kicked down the line, I guess. Shame.
Note that I presume this means that there's no way to load in tree sequence from a stream of data: we require a file on disk, not e.g. some data spooled to us by a DB or equivalent.
We have pre-exisiting tests for sockets: https://github.com/tskit-dev/tskit/blob/main/python/tests/test_fileobj.py#L289 and streams: https://github.com/tskit-dev/tskit/blob/main/python/tests/test_fileobj.py#L246
Oh, useful, thanks. Maybe it's because I'm on OS X. Weird that OS X doesn't do file descriptors properly though.
Although actually I still get the _tskit.FileFormatError: File not in KAS format
on holly, so I guess I'm doing something wrong here.
Digging a little more. I would have thought that if I pass a fileno to tskit.load
it should "just work", regardless of where the fileno comes from, but tskit seems to distinguish a fileno from a local file with one from urllib:
f = open("tmp.trees", "rb")
ts = tskit.load(f.fileno()) # works
f = urllib.request.urlopen("https://raw.github.com/tskit-dev/tutorials/main/data/basics.trees")
ts = tskit.load(f.fileno()) # fails
The failure is:
---------------------------------------------------------------------------
FileFormatError Traceback (most recent call last)
/usr/local/lib/python3.9/site-packages/tskit/trees.py in load(cls, file_or_path)
3480 ts = _tskit.TreeSequence()
-> 3481 ts.load(file)
3482 return TreeSequence(ts)
FileFormatError: File not in KAS format
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call last)
<ipython-input-131-02f25af848d6> in <module>
3
4 f = urllib.request.urlopen("https://raw.github.com/tskit-dev/tutorials/main/data/basics.trees")
----> 5 ts = tskit.load(f.fileno()) # fails
/usr/local/lib/python3.9/site-packages/tskit/trees.py in load(file)
2829 :rtype: :class:`tskit.TreeSequence`
2830 """
-> 2831 return TreeSequence.load(file)
2832
2833
/usr/local/lib/python3.9/site-packages/tskit/trees.py in load(cls, file_or_path)
3483 except exceptions.FileFormatError as e:
3484 # TODO Fix this for new file semantics
-> 3485 formats.raise_hdf5_format_error(file_or_path, e)
3486 finally:
3487 if local_file:
/usr/local/lib/python3.9/site-packages/tskit/formats.py in raise_hdf5_format_error(filename, original_exception)
271 h5py = get_h5py()
272 try:
--> 273 with h5py.File(filename, "r") as root:
274 version = tuple(root.attrs["format_version"])
275 raise exceptions.VersionTooOldError(
/usr/local/lib/python3.9/site-packages/h5py/_hl/files.py in __init__(self, name, mode, driver, libver, userblock_size, swmr, rdcc_nslots, rdcc_nbytes, rdcc_w0, track_order, fs_strategy, fs_persist, fs_threshold, **kwds)
410 name = repr(name).encode('ASCII', 'replace')
411 else:
--> 412 name = filename_encode(name)
413
414 if track_order is None:
/usr/local/lib/python3.9/site-packages/h5py/_hl/compat.py in filename_encode(filename)
17 filenames in h5py for more information.
18 """
---> 19 filename = fspath(filename)
20 if sys.platform == "win32":
21 if isinstance(filename, str):
TypeError: expected str, bytes or os.PathLike object, not int
~The HTTPResponse
doesn't support seek
, so won't work. Not all fileno's are equal.~
EDIT: This was very wrong.
Ah, that explains it. Didn't know we used seek
when reading in, thanks.
It looks like that file is a hdf5 file, not a kastore file. Non-seek file objects should work just fine through kastore.
It looks like that file is a hdf5 file, not a kastore file. Non-seek file objects should work just fine through kastore.
I'm pretty sure it's a kastore file:
>>> f = urllib.request.urlopen("https://raw.github.com/tskit-dev/tutorials/main/data/basics.trees")
>>> f.read()[0:10]
b'\x89KAS\r\n\x1a\n\x01\x00'
(same as you get when read()
ing from an msprime-generated tree sequence.
Well something funny is going on, because seek()
shouldn't be needed. File streaming support depends on avoiding seek calls.
stdin doesn't support seek()
for example, and this works just fine
$ python -c "import tskit, sys; tskit.load(sys.stdin)" < basics.trees
Ah, I understand now. After failing to read the kastore, tskit tries to load as h5py, and that's when it tries to seek. So yes, something weird here.
I've tracked this down a bit - as tskit's python interface is taking the file descriptor of the HTTPResponse
and then reopening it in the C API, it is getting the encrypted data! Fixing this in the general case is interesting - maybe creating a proxy file descriptor that is passed to the C API?
Not a high priority I think, but nicely done tracking it down. This stuff is always more complicated than it seems...
No, not planning to fix soon. Tracked it down as unexplained behaviour is always concerning!
Nicely triaged, @benjeffery, thanks. Agree about low priority.
Just coming back to this as a result of workshops and pyodide. It's more than likely that online tree sequences will be posted in .tsz
format (e.g. the unified genealogies on zenodo), so perhaps it's more of a priority to allow tszip.decompress(url="https://...")
I don't know if that's any easier to sort than tskit.load(url=...)
: I assume not.
For reference, here's what I'm doing as a workaround:
import tszip
import urllib.request
temporary_filename, _ = urllib.request.urlretrieve(
"https://zenodo.org/record/5495535/files/hgdp_tgp_sgdp_chr22_q.dated.trees.tsz"
)
ts = tszip.decompress(temporary_filename)
urllib.request.urlcleanup() # should remove temporary_filename
Adding url support to tszip sounds like a great idea - zarr already supports some forms of url access so we should hopefully be able to tap into that.
For teaching purposes, and possibly for other reasons too, I think it would be really useful to be able to load a tree sequence from a URL. IMO the nicest way to make this easy for a user would be to have a
url
argument totskit.load
:I would hope that this would mainly be used for small tree sequences, although I suppose it could be an easy way for someone to load up larger ones e.g. from zenodo - depends how long that would take I guess.
I suspect implementing this using the
urllib.request
library would be quite easy, although I don't know how we would unit test it - probably mock theurllib.request.urlopen
function when testing, or something.