Support CREMI-style pre- and post-synaptic site labels

clbarnes commented 6 years ago

Further to our discussion yesterday, I thought I'd add this to the board.

The CREMI format stores locations of point annotations for pre- and post-synaptic types; BigCAT displays these as points and arrows as below.

While the implementation wouldn't need to be the same (CREMI's depends on HDF's special datatypes for strings), some way of getting point annotations to show up would be very helpful. My use case, for example, takes pretty large regions of image data where a few synapses of interest have been pulled from CATMAID. I put them into CREMI format with @funkey 's cremi_python and use them to navigate and show where those synapses are.

We discussed possibly storing them as attributes in JSON to save delving into N5's custom data types, which are not supported by, for example, z5py.

Additionally, bigCAT only shows one edge per presynaptic site, where in CATMAID polyadic synapses are represented as a single presynaptic site and many postsynaptic sites. If the relaxed form could be supported in an eventual paintera implementation, that'd be great!

arrows

hanslovsky commented 6 years ago

I talked to @ssinhaleite and she said that @funkey preferred not to replicate the BigCAT annotations one to one but I don't know what changes he has in mind. @funkey, please comment. As far as I am concerned,

locations need to be stored as xyz rathern than zyx
colors of pre-/post should be configurable
in-memory representation should be unaware of storage backend
show (selected) partners in 3D viewer

clbarnes commented 6 years ago

It'd help me out if the label ID applied to the cleft could somehow be linked to the pre/ post/ pre-post edge, although I appreciate that that might be usage-specific (right now I'm only labelling the cleft as it pertains to one pre-post edge, but plenty of people would want to label the whole cleft, etc.).

hanslovsky commented 6 years ago

If I understand correctly, you would like to associate pre/post with a cleft id so you could do things like selecting all pre/post for a specific cleft and requires just one additional number (and not a String or any structured data). That sounds reasonable but would require that

clefts have individual labels
these labels are unique
whenever clefts are split (e.g. wrong merge in automatic segmentation), the associated pre/post would need to be updated.

Not all datasets may have individually labeled clefts (e.g. just foreground vs background label) In those cases, the associated id could default to an invalid value.

axtimwalde commented 6 years ago

Here are my 5ct:

The BigCAT annotations by @funkey are not bad at all and I believe that we should support this mode of annotation because it does not depend on anything else than somebody looking at the data and identifying synaptic partners by local cues.
Association with segmentation and clefts as necessary for analysis, partner prediction trainind and what not is something that we were able to extract from this simpler form of annotation but it should come after this exists. I agree that we should keep and open mind about associating annotations with additional properties, e.g. references to synapse ids and segment ids.
The simplest way in N5 to store String attributes is as attributes (i.e. JSON), instead of a dataset. An attribute that is a list of (uint64, String) tuples is no problem (except for the HDF5 backend), and that would solve the double referencing hop in the current HDF5 backend. Since Strings are Strings, storing them as Strings is not bad, i.e. JSON is ok. The downside of this is that it will not scale forever (but possibly further than HDF5). In particular GSON uses int32 indexed lists of objects to parse JSON which would allow us to store a maximum of 2^31 synaptic partners (which, BTW, for a fly, is just fine).

clbarnes commented 6 years ago

An alternative to the CREMI-style 1-D dataset of "pre/postsynaptic_site" strings would be a 1-D uint dataset encoding them as 0/1, and then a dict in the attributes JSON mapping those integers to their string representation. That saves space and readability of the JSON, as well as sidestepping the GSON limitation (large as it may be). It also means it would be trivial to extend the annotation system if someone wanted to include other features like vesicles, mitochondria and so on.

Comments would need to be stored in the JSON, though (but there may be fewer comments than annotations).

In fact, the annotation ID array (currently 1-D) could just be extended to being 2-by-n so that each annotation ID is stored with its type (encoded as an integer).

ssinhaleite commented 6 years ago

If I am not mistaken, what @funkey meant is that the annotations now are too tight to synapses with pre-post information. We talked about having a more generic abstraction where we could overcome this. Maybe using a graph model (with predefined schemes?) that could be selected by the user (depending on what he/she wants to annotate).

Discuss what are the (annotation) necessities and what we want to offer can be helpful to create a clean/smart implementation, instead of ended up refactoring everything later on.

clbarnes commented 6 years ago

I agree, a more generic approach where you can have arbitrary classes of point annotations would be nice. Being able to have arbitrary associations between those point annotations again would allow paintera to be generic over its use cases - having a menu to tell paintera how to render those associations from a small set of possibilities (e.g. with/without an arrowhead) would be very flexible.

In CATMAID, we do skeletonised annotation of neurites, and so with this generic implementation of annotations and annotation-associations, we would be able to port not only synapse information, but also entire skeletons into paintera, which would drastically improve the ease of turning sparse skeletonisations into volumetric labels. This would be really helpful for another couple of people in our lab as we're just starting off a project for flood-filling sparse annotations into volumetric labels, which will need training data and manual corrections.

I anticipate something along these lines (dimensions in N5 order):

/annotations/ids dataset, uint64, 2 x n
- ID of the annotation, and ID of its class
- As an attribute on this dataset, annotation_classes, an object which maps integer IDs to class names (like "presynaptic_site", "mitochondrion", "treenode")
/annotations/locations dataset, float, 3 x n
- Locations of the annotations, matched by index
/annotations/associations dataset, uint64, 4 x n
- ID of the association, ID of the first annotation, ID of the second annotation, ID of how the second is related to the first
- As an attribute on this dataset, association_classes, an object which maps integer IDs to association class names (like "postsynaptic_to", "abutting", "adjacent_treenode")

There could be another dataset for matching labels to annotations or annotation associations. For example, a synaptic cleft would want to be associated with a single presynaptic site (in the insect), but the postsynaptic surface area as it pertains to a single presynaptic site would want to be associated with a pre-post edge.

The class name space would be arbitrary and paintera could, initially by default or eventually through configuration, render specific classes in particular ways (like arrows for "postsynaptic_to").

This is obviously recapitulating some of the design concepts from CATMAID and that shouldn't shock you as I'm in the Cardona lab, and storing things in vectors is never going to be as efficient as using a database (although given how small point annotations are, it's not out of the question to use in-memory sqlite tables in the implementation).

I've got a bit carried away and have focused more on the file format than the paintera implementation, but I think that they're pretty tightly coupled, as bigCAT/ CREMI are.

funkey commented 6 years ago

What I had in mind was a generic graph implementation for annotations. Users can then instantiate concrete graphs by adding properties (directed, tree-shaped, bipartite, etc.) for a particular purpose, e.g.:

directed, bipartite, 1-to-1: For CREMI style partner annotations
directed, bipartite, 1-to-many: For star-shaped FIB25 style partner annotations
directed, tree: For skeletons
undirected, chains: For microtubules

clbarnes commented 6 years ago

That definitely has nice features in terms of being both generic and specific enough for different applications. My (possibly selfish) wish as part of that would be for nodes to be able to belong to more than one graph (so that a skeleton node could also be a postsynaptic partner, for example).

I guess in the N5 scheme, an edge list would be the best way to serialise such a graph? Although different storage paradigms would be optimal for different types and densities of graphs, I suppose.

axtimwalde commented 5 years ago

Just for completeness, I believe it would be great if this feature would incorporate an index of all edges intersecting a block similar to the index of labels present in a block. I.e. block_id -> [edge_id,...].

saalfeldlab / paintera

Support CREMI-style pre- and post-synaptic site labels #44