openworm / tracker-commons

Compilation of information and code bases related to open-source trackers for C. elegans
11 stars 12 forks source link

There is no consistent way to specify contour information #132

Open Ichoran opened 7 years ago

Ichoran commented 7 years ago

Many trackers that produce skeleton information also produce contour information, yet WCON provides no way to share that information across trackers.

Additionally, some skeleton-finding routines produce points that are suitable for approximation with a spline while others (chiefly erosion + pruning methods) do not. There is no way to indicate with metadata what the qualities of the skeleton (or contour) are.

The data format should be extended to allow both of these.

For contours, one possibility is to simply list all the points. This isn't particularly compact, however. An alternative is to pack pixel traversals bitwise into a binary array and then Base64 encode that array.

One possibility for a contour object that covers all of these possibilities is as follows:

"contour" indicates a contour around the animal; this is an optional parameter for readers and writers to support.

It is either an object, or, if there is more than one timepoint, an array of objects of the same length as the number of timepoints.

Each object has either a "points" field or a "px" field (or both).

If "points" exists, is either an array of numbers or two arrays of numbers, the latter being the two sides of the animal. Optionally, a "known" key can be either "head" or "d/v". If "d/v", "points" starts at the head and traces around the dorsal side, then ventral side (split between one or two arrays). If "head", "points" starts at the head, but it is not known whether it is traversing the dorsal or ventral side. If the key is missing, one should not assume either the dorsal or ventral side is known. Coordinates are relative to centroid and/or origin as with skeletons.

If "px" exists, it is an array with six or eight elements. First, the known value, either "" or "head" or "d/v", with the same meaning as for "known" in the "points" case, or an empty string if nothing is known. Second, the distance (in real space) per pixel edge. Third and fourth, the x,y coordinates of the first pixel in the outline (in real coordinates, not relative to centroid or skeleton). Fifth, the number of pixels traversals packed into the sixth element. Sixth, a Base64 encoding of a binary array where the transitions from one pixel to the next are encoded, least significant bit first, in a binary byte array; 00 means +x +y, 01 means +x-y, 10 means -x+y, and 11 means -x-y tracing of 4-connected pixels. If there is a seventh element, is the number of pixels in the return contour which must be present in the eighth element, and the sixth only traced half the contour.

This is a pretty complicated specification. I am already using something very much like the "px" encoding on my own data, and I'm happy to keep that under an @XJ flag if this seems like too much.

One could also assume that what is known is fixed and put it into metadata instead of repeating it for every timepoint for every animal.

@MichaelCurrie @cheelee @JimHokanson @aexbrown @ver228 - What do you all think?

ver228 commented 7 years ago

I would prefer the "points" field (probably because is simpler). My problem is that the array does become quite complicated by default. Something like this I guess: "contour" : [ [[[dx1,...][dy1, ...]] [[dy1,...][dy1, ...]]], [[[vx1,...][vy1, ...]] [[vy1,...][vy1, ...]]], [[[dx1,...][dy1, ...]] [[dy1,...][dy1, ...]]], [[[vx1,...][vy1, ...]] [[vy1,...][vy1, ...]]] ...]

I am a bit less confortable with the "px" format. It is a clever way to encode the info, but the part that worries me is to have a custom encoding in the file. In a way it defeats the purpose of a human readable format. If you have to specify an encoding, I would prefer to use something more widely used like BSON. It is more likely that people will have access to a format used in the industry rather that something defined in-house.

@Ichoran I wouldn't say there is no consistent way to specify this information, but rather there is not really a very compact way to specify the information. We need 5 pieces of information: 1) coordinates, 2) head position, 3) tail position, 4) ventral side, 5) dorsal side.

I would propose to store the information in the name of optional fields. If you have a single contour use the fields "cnx", "cny". If you have a dorsal and a ventral countour you can use the fields "vcnx", "vcny", "dcnx", "dcny". For the head orientation I would use the same field as in the WCON specification for the skeletons (L: first x-y; R: last x-y, ? unknown). In the case of single contour ("cnx", "cny") the orientation can be specified in the field "ventral" in the WCON specification, and I would add an optional field "tail" with the corresponding index in the contour.

Ichoran commented 7 years ago

Base64 is extremely widely used, far more so than BSON. Writing another spec to use a binary format is possible but beyond the scope of this project currently. BSON doesn't have an arbitary-bit-depth array packing format either, so it'd have to be exactly as complicated here except that you don't need to specify "Base64" once you decide how to embed a contour walk into bytes.

But I'm completely happy with keeping contour information as an @XJ-specific thing. It just means that if anyone else wants contours out of my raw data they'll have to use the Scala implementation (or preprocess the data with Choreography). I really cannot afford the larger file size; this reduces it by about 10x.

Optional fields are okay too. Maybe use "p" for perimeter? I agree about using the skeleton orientation field. So maybe "px", "py" for a single perimeter. I would resolve head/tail more simply: the skeleton can be used to cut your px, py contour in half. If you don't want to store the entire skeleton, you can just store head and tail points to disambiguate the ends. Then the existing fields are adequate to recover all the information.

Ichoran commented 7 years ago

@MichaelCurrie - Do you want to weigh in? I expect that this optional feature isn't going to be all that optional for your implementation if you're working with the people who want it :P

As I understand Avelino's proposal, the data would be treated just like x and y, so the implementation should not be too difficult. (At worst, abstract out the x and y normalization by centroid and/or origin and apply that to the perimeter points.)

MichaelCurrie commented 7 years ago

This sounds good to me. I haven't looked at this specific proposal in detail yet but I very much agree with the idea of a native contour format since the OpenWorm, Kerr, Schafer Lab, and Brown Lab use cases all use them :)

ver228 commented 7 years ago

@Ichoran sorry for the late answer I have had quite a lot of work recently.

First one question, you mention to preprocess data with Choreography. What is Choreography? Is it a custom made software, do you have the link to the repository?

I have mixed feelings about the px encoding. For once it is a pretty cool way to store data. It cost me a bit to understand it, but I see how it saves a lot of space (However, I still didn't understood the 7th and 8th elements in the array, I though the number of pixels will be encoded in the 5th element, number of traversals).

Now my problem is that since it is a custom encoding we should include the tools not only to read but to write in that format. I mean that if somebody has an array of contour pixels the tool should be able to encode into the Base64 traversal array, otherwise people will not really use the encoding, and it should be a lab custom feature. Base64 might be a very common, but if I am understanding having three pixels encoded in each character as displacements from previous displacements is not.

I agree that we could make the storage more compact by using the fields "px" and "py". However, I am not sure about storing the head/ tail break using only the skeleton. I will prefer to add a tail index field, something like "ti". My reason is that I don't trust in the equal operator for floats. Sometimes it returns false if the values differ by an epsilon. Obviously this can me solved by rounding or looking for the minimal distance, but it is annoying. In my data, I am always storing a "normalized" contour with always 49 points per side. In my case I will be happy to define a global variable "tail_index".

On other topic, the contours must store their clockwise/conterclockwise orientation, right? It can be deduced with the shoelace formula.

Now, if I am understanding we could have the two ways to store contours that rex proposed, with slight modifications: 1) There will be a "px" and "py" instead of points. And there might be a "tail_index" or "ti", probably only global? 2) What will be the new name for px, then? "pix" from pixels? What about if somebody once to use this encoding to store the skeletons information? Would it be allowed?

Ichoran commented 7 years ago

Choreography is the post-processing software that takes whatever is analyzed online by the Multi-Worm Tracker and produces derived features. The released version doesn't handle WCON, so it's not very useful to you. (It's here.)

I guess it's okay to have a tail index.

We can call the pixel walk "walk". Skeletons are forbidden because it's a stupid format for skeletons--none of the decent skeleton-finding algorithms from worms actually use the image morphology "skeleton", so the skeleton isn't pixelated. And we can drop the 7th and 8th entries for the return contour if we're cutting based on the skeleton, and the first if we're using the same entries as for skeletons. So it'd look like:

"walk": [0.0275, 151.215, 281.155, 62, "asvjao834fLIAfeWhah8E"]
Ichoran commented 7 years ago

I will have a draft Scala implementation ready tomorrow, assuming that the cold that I'm catching isn't too bad.

Ichoran commented 7 years ago

Took longer than I was expecting, due in part to bugs in unit conversions and difficulty with normalization with centroids. https://github.com/openworm/tracker-commons/pull/136 contains a draft.

Ichoran commented 7 years ago

Now there is a specification! Please see #146 - does this look okay to you @ver228 ?

ver228 commented 7 years ago

The specifications are great for me. I already implemented the changes to our data. It would be great if you could try this example file.

It seems that I can read the data using the scala implementation as:

import org.openworm.trackercommons._
val fname = "./asic-1 (ok415) on food L_2010_07_08__11_46_40___7___5.wcon"
val worms = ReadWrite.read(fname).right.get

However when I tried to extract the data I receive the following error.

val data = worms.combined() match {
  case Right(ds) => ds
  case Left(err) => println(err); throw new Exception
}

<console>:16: error: value combined is not a member of org.openworm.trackercommons.DataSet
       val data = worms.combined() match {

If I execute worms.data I get a java.lang.OutOfMemoryError: Java heap space.

Hopefully @Ichoran could give me some advice.

Ichoran commented 7 years ago

@ver228 - I'll have a look at this in the next day or so. The latest Scala version should handle it fine, but I was reworking combined so that doesn't exist in the version you're using. You can increase the heap space available to the JVM by invoking -J-Xmx3G or somesuch as an argument to the sbt or scala commands, or -Xmx3G as an argument to java alone, depending on how you're running things.