tskit-dev / tskit

Population-scale genomics
MIT License
155 stars 73 forks source link

Support loading from URLs in Python #1566

Open hyanwong opened 3 years ago

hyanwong commented 3 years ago

For teaching purposes, and possibly for other reasons too, I think it would be really useful to be able to load a tree sequence from a URL. IMO the nicest way to make this easy for a user would be to have a url argument to tskit.load:

tskit.load(url="https://tskit.dev/tutorials/data/basics.trees")

I would hope that this would mainly be used for small tree sequences, although I suppose it could be an easy way for someone to load up larger ones e.g. from zenodo - depends how long that would take I guess.

I suspect implementing this using the urllib.request library would be quite easy, although I don't know how we would unit test it - probably mock the urllib.request.urlopen function when testing, or something.

hyanwong commented 3 years ago

NB: this would mean that content in e.g. jupyterbooks could simply be pasted into the user's jupyter session and (as long as the right libraries were installed), the examples should all run. It's less error-prone and fiddly than trying to run the jupyterbook e.g. in binder (see https://github.com/tskit-dev/tutorials/issues/114)

jeromekelleher commented 3 years ago

The right way to do this I think would be to make the string form of the file input support URLs, which we intercept in Python. BUT this would be quite a lot of work to do well (it's simple to do badly) so I'm not super keen on it right now.

hyanwong commented 3 years ago

Just to say that I have wanted to do this for a while, for other purposes too (not just the tutorials), so I am keen to get something like this done. But maybe others could weigh in if they have needed to do this in the past (as it could just be me).

Out of interest, why would it be a lot of work to do well? I would have thought the standard URL libraries would do most of the hard lifting?

hyanwong commented 3 years ago

Presumably the workaround for the moment is simply e.g.

import urllib.request
file = urllib.request.urlopen('file:///Users/yan/Documents/GitHub/tutorials/data/basics.trees')
ts = tskit.load(file)

Is there any reason that we shouldn't be doing this as a workaround?

hyanwong commented 3 years ago

I've been thinking about this, and actually I do think that the form load(url="http://...") is worth considering, for two reasons. Firstly the silly one, that you can still load a file starting with "http" in this case. Secondly, perhaps a more sensible one, that it makes it much easier to document and code without complications, as you can simply pass the url parameter to urllib.request.urlopen. In fact, I think that makes it about a 4 line addition to tskit, and to unit test it you can simply try it with a file:/// url, which should be pretty simple. Then we simply say "anything passed as a url parameter is passed on to urllib.request to open" (or something like that. I think this would tick the box of "doing it well" without having to check on the format of the string, which, as @jeromekelleher points out, is a bit icky.

jeromekelleher commented 3 years ago

It's about much more than just the parameters @hyanwong - once you start interacting with the network all sorts of extra complications come in, like http proxies, error handling, retries, etc etc etc. The nominal case is trivial - it's the error and edge cases that are really tricky. These are hard to implement and especially hard to test. I've done it several times and don't particularly want to spend the time doing it now.

jeromekelleher commented 3 years ago
import urllib.request
file = urllib.request.urlopen('file:///Users/yan/Documents/GitHub/tutorials/data/basics.trees')
ts = tskit.load(file)

This seems like a perfectly reasonable thing to do to me (assuming it works).

hyanwong commented 3 years ago

It's about much more than just the parameters @hyanwong - once you start interacting with the network all sorts of extra complications come in, like http proxies, error handling, retries, etc etc etc. The nominal case is trivial - it's the error and edge cases that are really tricky. These are hard to implement and especially hard to test. I've done it several times and don't particularly want to spend the time doing it now.

Right, that sounds reasonable. I thought that we could simply say that we are handing all this stuff over to urllib.request.urlopen and therefore if it doesn't work (e.g. problems with proxies etc) it's not something we are responsible for. But maybe that's a cop out.

benjeffery commented 3 years ago

I'd be much more motivated to do this if tskit had any sense of being able to use part of a file, as zarr for example does. tskit needs the whole file to do anything, so you have to download it all anyway. The tutorials should also be using the most common way of loading a .trees - from disk. Adding url arguments at that point adds confusion.

Pointing out that urlopen can be used in the load docs seems a good idea though.

hyanwong commented 3 years ago

Annoyingly I find that urlopen doesn't actually work for http(s) urls:

import urllib.request
file = urllib.request.urlopen("https://raw.github.com/tskit-dev/tutorials/main/data/basics.trees")
tskit.load(file)

Gives FileFormatError: File not in KAS format, presumably because, although file.fileno() is valid here, which I think is the requirement tested in KAS, it's not actually stored on disk quite like a normal file, and the part of the tskit C library that loads from a file doesn't know what to make of it. So yes, as Jerome says, this is more complex than I had hoped, and can be kicked down the line, I guess. Shame.

Note that I presume this means that there's no way to load in tree sequence from a stream of data: we require a file on disk, not e.g. some data spooled to us by a DB or equivalent.

benjeffery commented 3 years ago

We have pre-exisiting tests for sockets: https://github.com/tskit-dev/tskit/blob/main/python/tests/test_fileobj.py#L289 and streams: https://github.com/tskit-dev/tskit/blob/main/python/tests/test_fileobj.py#L246

hyanwong commented 3 years ago

Oh, useful, thanks. Maybe it's because I'm on OS X. Weird that OS X doesn't do file descriptors properly though.

hyanwong commented 3 years ago

Although actually I still get the _tskit.FileFormatError: File not in KAS format on holly, so I guess I'm doing something wrong here.

hyanwong commented 3 years ago

Digging a little more. I would have thought that if I pass a fileno to tskit.load it should "just work", regardless of where the fileno comes from, but tskit seems to distinguish a fileno from a local file with one from urllib:

f = open("tmp.trees", "rb")
ts = tskit.load(f.fileno())  # works

f = urllib.request.urlopen("https://raw.github.com/tskit-dev/tutorials/main/data/basics.trees")
ts = tskit.load(f.fileno())  # fails

The failure is:

---------------------------------------------------------------------------
FileFormatError                           Traceback (most recent call last)
/usr/local/lib/python3.9/site-packages/tskit/trees.py in load(cls, file_or_path)
   3480             ts = _tskit.TreeSequence()
-> 3481             ts.load(file)
   3482             return TreeSequence(ts)

FileFormatError: File not in KAS format

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
<ipython-input-131-02f25af848d6> in <module>
      3 
      4 f = urllib.request.urlopen("https://raw.github.com/tskit-dev/tutorials/main/data/basics.trees")
----> 5 ts = tskit.load(f.fileno())  # fails

/usr/local/lib/python3.9/site-packages/tskit/trees.py in load(file)
   2829     :rtype: :class:`tskit.TreeSequence`
   2830     """
-> 2831     return TreeSequence.load(file)
   2832 
   2833 

/usr/local/lib/python3.9/site-packages/tskit/trees.py in load(cls, file_or_path)
   3483         except exceptions.FileFormatError as e:
   3484             # TODO Fix this for new file semantics
-> 3485             formats.raise_hdf5_format_error(file_or_path, e)
   3486         finally:
   3487             if local_file:

/usr/local/lib/python3.9/site-packages/tskit/formats.py in raise_hdf5_format_error(filename, original_exception)
    271     h5py = get_h5py()
    272     try:
--> 273         with h5py.File(filename, "r") as root:
    274             version = tuple(root.attrs["format_version"])
    275             raise exceptions.VersionTooOldError(

/usr/local/lib/python3.9/site-packages/h5py/_hl/files.py in __init__(self, name, mode, driver, libver, userblock_size, swmr, rdcc_nslots, rdcc_nbytes, rdcc_w0, track_order, fs_strategy, fs_persist, fs_threshold, **kwds)
    410                 name = repr(name).encode('ASCII', 'replace')
    411             else:
--> 412                 name = filename_encode(name)
    413 
    414             if track_order is None:

/usr/local/lib/python3.9/site-packages/h5py/_hl/compat.py in filename_encode(filename)
     17     filenames in h5py for more information.
     18     """
---> 19     filename = fspath(filename)
     20     if sys.platform == "win32":
     21         if isinstance(filename, str):

TypeError: expected str, bytes or os.PathLike object, not int
benjeffery commented 3 years ago

~The HTTPResponse doesn't support seek, so won't work. Not all fileno's are equal.~

EDIT: This was very wrong.

hyanwong commented 3 years ago

Ah, that explains it. Didn't know we used seek when reading in, thanks.

grahamgower commented 3 years ago

It looks like that file is a hdf5 file, not a kastore file. Non-seek file objects should work just fine through kastore.

hyanwong commented 3 years ago

It looks like that file is a hdf5 file, not a kastore file. Non-seek file objects should work just fine through kastore.

I'm pretty sure it's a kastore file:

>>> f = urllib.request.urlopen("https://raw.github.com/tskit-dev/tutorials/main/data/basics.trees")
>>> f.read()[0:10]
b'\x89KAS\r\n\x1a\n\x01\x00'

(same as you get when read()ing from an msprime-generated tree sequence.

grahamgower commented 3 years ago

Well something funny is going on, because seek() shouldn't be needed. File streaming support depends on avoiding seek calls.

grahamgower commented 3 years ago

stdin doesn't support seek() for example, and this works just fine

$ python -c "import tskit, sys; tskit.load(sys.stdin)" < basics.trees
benjeffery commented 3 years ago

Ah, I understand now. After failing to read the kastore, tskit tries to load as h5py, and that's when it tries to seek. So yes, something weird here.

benjeffery commented 3 years ago

I've tracked this down a bit - as tskit's python interface is taking the file descriptor of the HTTPResponse and then reopening it in the C API, it is getting the encrypted data! Fixing this in the general case is interesting - maybe creating a proxy file descriptor that is passed to the C API?

jeromekelleher commented 3 years ago

Not a high priority I think, but nicely done tracking it down. This stuff is always more complicated than it seems...

benjeffery commented 3 years ago

No, not planning to fix soon. Tracked it down as unexplained behaviour is always concerning!

hyanwong commented 3 years ago

Nicely triaged, @benjeffery, thanks. Agree about low priority.

hyanwong commented 2 years ago

Just coming back to this as a result of workshops and pyodide. It's more than likely that online tree sequences will be posted in .tsz format (e.g. the unified genealogies on zenodo), so perhaps it's more of a priority to allow tszip.decompress(url="https://...") I don't know if that's any easier to sort than tskit.load(url=...): I assume not.

For reference, here's what I'm doing as a workaround:

import tszip
import urllib.request
temporary_filename, _ = urllib.request.urlretrieve(
    "https://zenodo.org/record/5495535/files/hgdp_tgp_sgdp_chr22_q.dated.trees.tsz"
)
ts = tszip.decompress(temporary_filename)
urllib.request.urlcleanup() # should remove temporary_filename
jeromekelleher commented 2 years ago

Adding url support to tszip sounds like a great idea - zarr already supports some forms of url access so we should hopefully be able to tap into that.

benjeffery commented 2 years ago

https://github.com/tskit-dev/tszip/issues/66