bigBed tileset service methods and tests

higlass / clodius

Clodius is a tool for breaking up large data sets into smaller tiles that can subsequently be displayed using an appropriate viewer.

MIT License

39 stars 21 forks source link

bigBed tileset service methods and tests #113

Closed alexpreynolds closed 4 years ago

alexpreynolds commented 5 years ago

Description

What was changed in this pull request?

Added routines to generate tiles from pybbi-mediated interval queries to (particularly BED12-formatted) bigBed files. Added unit tests.

Tile queries are of the forms:

x.${z}.${y}
x.${z}.${y}.significant
x.${z}.${y}|max:A
x.${z}.${y}|min:A|max:B
x.${z}.${y}.significant|max:A
x.${z}.${y}.significant|min:A|max:B

The definitions of z and y represent tile zoom level and tile position.

The significant range key is the default, i.e. leaving it out does queries based on score significance stored in the bigBed score column (if present).

What is meant by "significant" is that, for the specified zoom level and tile position, a quantile is generated which returns the score that would threshold a maximum number of elements, either specified by max in the tile query, or by the constant MAX_ELEMENTS defined in clodius/tiles/bigbed.py.

If a tile query returns more than the selected maximum number of elements, then the set of elements is thresholded by the quantile-derived score, and a uniformly random sample is drawn from the set of elements, after the score threshold is applied, to ensure that the set is no larger than the requested maximum number of elements.

Why is it necessary?

To begin to add support for bigBed queries by higlass-server, and ultimately gene annotation track visualizations in the higlass plugin.

Checklist

[x] Unit tests added or updated
[x] Updated CHANGELOG.md

alexpreynolds commented 5 years ago

I ended up reverting most of these commits as they broke serving elements through higlass-server. Still investigating.

alexpreynolds commented 5 years ago

Possible code style issues?

flake8
./clodius/tiles/bigbed.py:6:1: F401 're' imported but unused
./clodius/tiles/bigbed.py:7:1: F401 'sys' imported but unused
./clodius/tiles/bigbed.py:41:5: F841 local variable 'n_bins' is assigned to but never used
./clodius/tiles/bigbed.py:45:9: F841 local variable 'clen' is assigned to but never used
./clodius/tiles/bigbed.py:46:1: W293 blank line contains whitespace
./clodius/tiles/bigbed.py:60:1: W293 blank line contains whitespace
./clodius/tiles/bigbed.py:64:1: W293 blank line contains whitespace
./clodius/tiles/bigbed.py:72:1: W293 blank line contains whitespace
./clodius/tiles/bigbed.py:76:1: W293 blank line contains whitespace
./clodius/tiles/bigbed.py:79:1: W293 blank line contains whitespace
./clodius/tiles/bigbed.py:83:1: W293 blank line contains whitespace
./clodius/tiles/bigbed.py:88:1: W293 blank line contains whitespace
./clodius/tiles/bigbed.py:92:54: E226 missing whitespace around arithmetic operator
./clodius/tiles/bigbed.py:98:1: W293 blank line contains whitespace
./clodius/tiles/bigbed.py:114:1: W293 blank line contains whitespace
./clodius/tiles/bigbed.py:124:1: W293 blank line contains whitespace
./clodius/tiles/bigbed.py:140:1: W293 blank line contains whitespace
./clodius/tiles/bigbed.py:168:1: W293 blank line contains whitespace
./clodius/tiles/bigbed.py:171:1: W293 blank line contains whitespace
./clodius/tiles/bigbed.py:188:1: W293 blank line contains whitespace
./clodius/tiles/bigbed.py:192:45: W291 trailing whitespace
./clodius/tiles/bigbed.py:196:1: W293 blank line contains whitespace
./clodius/tiles/bigbed.py:226:1: W293 blank line contains whitespace
./clodius/tiles/bigbed.py:239:1: W293 blank line contains whitespace
./test/tiles/bigbed_test.py:3:1: F401 'numpy as np' imported but unused
./test/tiles/bigbed_test.py:4:1: F401 'base64' imported but unused
./test/tiles/bigbed_test.py:5:1: F401 'sys' imported but unused
./test/tiles/bigbed_test.py:7:1: E302 expected 2 blank lines, found 1
./test/tiles/bigbed_test.py:12:1: W293 blank line contains whitespace
./test/tiles/bigbed_test.py:19:1: W293 blank line contains whitespace
./test/tiles/bigbed_test.py:31:33: E231 missing whitespace after ','
./test/tiles/bigbed_test.py:31:40: E231 missing whitespace after ','
The command "./travis_test.sh" exited with 1.

pkerpedjiev commented 5 years ago

Looks much better 👍 Thanks for making all those changes.

I just added one more comment about documenting the return type of the tile function. It looks like you're returning an array for each chromosome which differs from what we do in the traditional gene annotations track where we just return one array containing all the entries (i.e. not broken down by chromosome).

alexpreynolds commented 5 years ago

I added an explanatory docstring and added exception handling for scoreless or non-integer score data in the source bigBed.

I do appear to be returning an array for each chromosome. I am not intentionally doing this, but this appears to be a consequence of how querying is done here on queries spanning chromosomes, and that starts and ends are provided for each chromosome:

https://github.com/alexpreynolds/clodius/blob/a87fb82077a11c19cb557d7ed3a62a9d6c9ebbf8/clodius/tiles/bigbed.py#L177-L191

I had tried collapsing this list-of-lists to one list before passing it to higlass-server, but there are problems on the higlass-server side, where I have difficulties getting Redis to correctly deserialize what it caches.

A collapsed list which is get from Redis looks like it is split into separate characters:

https://github.com/alexpreynolds/higlass-server/blob/bigbed-support/tilesets/views.py#L437-L438

If I set a list-of-lists as they are, then that can be correctly get via Redis later on in the block of code linked above.

Here is where the data are set if not previously cached:

https://github.com/alexpreynolds/higlass-server/blob/bigbed-support/tilesets/views.py#L479-L487

When later retrieved, either through a direct query from the bigBed file, or deserialized from Redis, I do indeed collapse after the fact:

https://github.com/alexpreynolds/higlass-server/blob/bigbed-support/tilesets/views.py#L489-L491

This collapsed list does look like what the current beddb gene annotation track parses, which is why I collapse it here.

I'm happy to do this a different and less convoluted way, but I would ask for advice in how to collapse the list-of-lists on the clodius side, in such a way that Redis can serialize and deserialize it properly?

pkerpedjiev commented 4 years ago

Would you be able to post the error that redis throws? It's strange that it can deserialize a list of lists but not a list.

Have you had a chance to look at how clodius.tiles.beddb formats its return values? https://github.com/higlass/clodius/blob/268f4adc5e662b3e2779d6b8a07f59a8e0b79325/clodius/tiles/beddb.py#L74

IMO, if possible, it would be better to do the collapsing in clodius so that higlass-server just responds to request without further formatting returned tiles.

alexpreynolds commented 4 years ago

Wouter Meuleman has asked that I examine an aggregation method that is applied at higher zoom levels (levels closer to whole-genome scale) — perhaps something that counts intervals over bins and which the higlass BED12 plug-in (PR forthcoming) would switch over from rendering exon blocks to rendering as a 1D density-like heatmap or line plot.

This behavior would be somewhat similar to what a mode of the UCSC browser does, except that it appears to do this by estimated base coverage and not interval coverage, which would not make sense here.

I'm not sure if you'd prefer leaving this PR open as I work on this, or if you'd like this PR to be wrapped in sooner?

In my opinion, having this PR wrapped in would make it easier to get the currently-open higlass-server PR worked in (and to be able to submit a PR for the higlass bigBed-BED12 visualization piece, itself).

I can then add the aggregation feature after, as the next stage of bigBed support.

(I guess I'm asking if you folks are generally satisfied with this PR as it is now?)

pkerpedjiev commented 4 years ago

Sorry, completely lost track of this PR. I just merged it and will cut a new release once I finish replying here :-) As always, please feel free to send reminders if I don't respond to something within a day or two.

For the other aggregation method, I was playing around with something even simpler than displaying interval density. What it does is merge intervals that are closer than some number of pixels apart visually. It's simple in concept and best demonstrated with an example:

https://observablehq.com/d/cee1a89063441de0

The logic is that the vast majority of the time you only care if there is a feature somewhere and not what the density of intervals is. This aggregation method introduces new features of type "filler" which just denote regions with features in them. So if you're zoomed way out and some features are hidden, the filler region that contains them will still be shown indicating that if you zoom in, you'll find more data.

This is already implemented for the Genbank data fetcher in higlass here: https://github.com/higlass/higlass/blob/49e52d1c77ca7fbd2a74b711f36ccbc24c7a46b3/app/scripts/data-fetchers/genbank-fetcher.js#L14

There is also a PR open to update the Gene Annotations track to support this datatype but I haven't merged it yet: https://github.com/higlass/higlass/pull/735

Take a look and let me know what you think.