shanaxel42 commented 5 years ago

Objective

As a spaceTX data analyst with an experiment that does not have a matching number of rounds and channels, I want starfish to support this data model, so that I can create a starfish benchmarking pipeline

Acceptance Criteria

Simone can process his data with starfish
Brian Long can process his data with starfish

Notes

Encapsulates work that needs to be done to support data coming in from Ola, Simone, Brian Long, and presumably RNAscope users

ambrosejcarr commented 5 years ago

@ttung identified that putting all the images into auxiliary_images is a feasible work-around for the time being.

This is correct, but I believe we need to be able to provide first-class support for SpaceTx assays. In particular, sequential smFISH seems likely to be a popular assay, and so it feels illogical to treat it as an edge case and solve the issue with a work around.

Although changes to this part of the format don't necessarily interact with the code base beyond the Experiment and Codebook classes, we're still in a position where we might be able to get away with deprecating support for older image formats, so I think it would be prudent to make a decision on this soon, before more external groups pick up SpaceTx Format.

ttung commented 5 years ago

Thesis

We should not be opinionated about how people organize their data on disk. We should make it easy to load data regardless of how it is organized. Therefore, we should support the following models:

aux tiles stored inside the primary image collection.
a primary image collection that is not uniformly sized on round/ch/z.
a primary image that resides in multiple collections.

Data model

When people process their data, we should be opinionated about the data structure they load into. We should advocate that an ImageStack represent a primary image or an auxiliary image, or subsets (ideally, all the tiles that are physically aligned in X-Y space) of one. An ImageStack should not represent hold data from both a primary image and an auxiliary image, even if they are aligned in X-Y space.

Examples

Loading data when aux tiles are stored inside the primary image

Assuming ch1 contains the nuclei stains, and ch0, 2, 3 contains the primary image.

primary_image = exp['fov_001'].get_images(FieldOfView.PRIMARY_IMAGES, {Axes.CH: {0, 2, 3}})
nuclei_image = exp['fov_001'].get_images(FieldOfView.PRIMARY_IMAGES, {Axes.CH: {1}})

Loading data when a primary image collection is not uniformly sized on round/ch/z

Loading each round into a separate imagestack.

round_0 = exp['fov_001'].get_images(FieldOfView.PRIMARY_IMAGES, {Axes.ROUND: {0}})
round_1 = exp['fov_001'].get_images(FieldOfView.PRIMARY_IMAGES, {Axes.ROUND: {1}})

We should also support loading all the data as a single ImageStack. In this case, the incomplete data should be written as NaN. This has memory utilization implications, but we leave that to the user.

primary_image = exp['fov_001'].get_images(FieldOfView.PRIMARY_IMAGES)

Loading data when a primary image that resides in multiple collections.

round_0 = exp['fov_001'].get_images('round_0')
round_1 = exp['fov_001'].get_images('round_1')
round_2 = exp['fov_001'].get_images('round_2')

combiner = Filter.Combiner()
primary_image = combiner.run(round_0, round_1, round_2)

It may be necessary to provide a map between input round/channel/z values and output round/channel/z values.

What's missing

We need crop-on-load in get_image/ get_images (#1259)
We need verify that we can load an ragged tensor into an ImageStack. Missing values need to be NaN and not 0s as they would be today.
We need to build a Combiner filter.

ambrosejcarr commented 5 years ago

Was the first example meant to be the following?

primary_image = exp['fov_001'].get_images(FieldOfView.PRIMARY_IMAGES, {Axes.CH: {0, 2, 3}})
nuclei_image = exp['fov_001'].get_images(FieldOfView.PRIMARY_IMAGES, {Axes.CH: {1}})

For Combiner I'd suggest the method live somewhere outside Filter but I'm on board with the concept.

Everything else looks 👍

ttung commented 5 years ago

Oh yeah. Whoops copy paste. :)

dganguli commented 5 years ago

What does the Combiner do? Does it create a single ImageStack out of a set of ImageStacks?

dganguli commented 5 years ago

Modulo my last comment, this all seems very thoughtful and I'm totally on-board.

ttung commented 5 years ago

What does the Combiner do? Does it create a single ImageStack out of a set of ImageStacks?

It creates a single ImageStack out of a set of ImageStacks.

dganguli commented 5 years ago

what concrete use-case are you thinking of for

a primary image that resides in multiple collections.

which leads to the desire for a Combiner? Apologies for being slow here. By concrete use-case, I mean is there a group/assay that requires this and why?

ttung commented 5 years ago

a primary image that resides in multiple collections.

I don't think this is how people should organize data, but my thesis is that we shouldn't be opinionated about how people organize their data as long as it's labeled correctly. We should just provide tooling for people to ingest that data into starfish regardless of any questionable choices they may have made in data organization. :)

dganguli commented 5 years ago

definitely agree with you there -- just wondering if anyone at spaceJam or anyone you've encountered has actually organized their data/processing in this way? or is 3. purely theoretical? i know for 1. and 2. i've seen actual examples, but for 3. i haven't.

ttung commented 5 years ago

(3) isn't purely theoretical. We advised this as a stopgap measure for uncoded assays while we figure out how to support things better. I'm hoping people don't use it longer than necessary, but we've opened the door.

dganguli commented 5 years ago

Got it. I still don't get what problem (3) solves for un-coded assays though? Can you explain it to me?

dganguli commented 5 years ago

Update: @ttung explained this to me offline. Basically, if I'm trying to load data into an ImageStack, where the individual tiles, across FOVs (for example) arise from different sizes (x, y, z) and there are some rounds/channels missing, you can still load all this into an ImageStack. Combiner will handle the logic of how this will actually all work.

zperova commented 5 years ago

@ttung @ambrosejcarr What are your thoughts on supporting contributors with stitched images since most of the data stored/processed in lab is stitched.

ttung commented 5 years ago

@zperova: short term solution is probably to build tooling to split a stitched image with overlaps.

berl commented 5 years ago

Sorry for chiming in so late, but this does sound really great.

a few questions/comments

How will this relate to AlignedImageStacks? Do those still exist? Will these methods have any constraints on the alignment?
How is a user with this kind of data going to run a pipeline on it? heavy use of the Combiner and get_images()? before of a more "normal" pipeline?
Will the output(s) (e.g. of the Combiner still know about / log the original round/channel/aux identity?
this will address half of #1121
many of the published methods that leverage barcoding (MERFISH, starmap, SeqFISH) also include non-barcoded parts of the experiment. Although the current idea doesn't fix the composite codebook problem (other half of #1121) it should allow that kind of data to live happily in starfish

ttung commented 5 years ago

@berl thanks for looking at this. Your input is really important, so I didn't want to get started without first getting it. :)

How will this relate to AlignedImageStacks? Do those still exist? Will these methods have any constraints on the alignment?

ImageStacks will still be aligned on X-Y. FieldOfView.get_images(..) will take a set of selections for each of r/ch/z (and these selections can be unspecified, which means "all"), and return a set of ImageStack objects. Each of those will be aligned on X-Y.

Imagine that you have R={0, 1, 2, 3, 5, 6}. {0, 2} are aligned. {1} is by itself. {3, 5, 6} are aligned.

FieldOfView.get_images(r={0, 2, 5}) will yield two ImageStacks. The first will consist of R={0, 2}. The second will consist of R={5}.

How is a user with this kind of data going to run a pipeline on it? heavy use of the Combiner and get_images()? before of a more "normal" pipeline?

Combiner is not going to be encouraged as it has some poor memory characteristics. Eventually when we build better lazy-loading semantics, it might work better.

We will encourage users to organize their data such that the primary_image is one Collection (and whether the auxiliary images are in that Collection, we're less opinionated about).

Will the output(s) (e.g. of the Combiner still know about / log the original round/channel/aux identity?

If the original r/ch/aux identity is unique, the Combiner should preserve them.

For instance, ImageStack 1 has R={0, 2, 3} and ImageStack 2 has R={4, 5}. If you combine them, we should get R={0, 2, 3, 4, 5}.

However, if you have ImageStack 1 has R={0, 2, 3} and ImageStack 2 has R={2, 5}, the user will have to provide a mapping between input R and output R, which should be recorded in provenance logging.

this will address half of #1121 many of the published methods that leverage barcoding (MERFISH, starmap, SeqFISH) also include non-barcoded parts of the experiment. Although the current idea doesn't fix the composite codebook problem (other half of #1121) it should allow that kind of data to live happily in starfish

I don't get what that's all about, but @dganguli is going to explain it to me tomorrow. :)

njmei commented 5 years ago

This system sounds good and looks like it should have the flexibility to deal with all sorts of interesting data arrangements.

Practically, will anything be changing for TileFetchers and getting external data into spacetx format? It looks like the proposed system deals with weird data organization strictly after conversion to starfish format (e.g. separation of images into AUX and PRIMARY tiles)?

ttung commented 5 years ago

TileFetchers should not be affected by these changes.

joshmoore commented 5 years ago

https://github.com/spacetx/starfish/issues/1322#issuecomment-491938207 We advised this as a stopgap measure for uncoded assays while we figure out how to support things better

As a side-note, use of spacetx-writer in these cases is based on what is (confusingly) called the FileStitcher API which matches your Combiner. You can see https://docs.openmicroscopy.org/bio-formats/6.0.1/developers/file-reader.html#file-reading-extras for more info. We've never been able to get away from it. Looking forward to what you come up with!

ttung commented 5 years ago

We're removing the ability to load a ragged array (i.e., an array where the cardinality of round/channel/zplane is not uniform) into a single ImageStack. Instead, the prescribed method will be to load the data by round, and process them that way. See the block of text that has been struck out in https://github.com/spacetx/starfish/issues/1322#issuecomment-491466070

njmei commented 5 years ago

@ttung

Hmm... so stumbled upon a wrinkle while trying to figure out how to coerce starfish into converting data into starfish format on a per round basis that relates to this issue and dredges #1121 back from the grave to a certain extent as well.

The problem: Starfish does not appear to be able to convert standalone rounds (that know their own round index) without resorting to significant trickery and losing round information that (probably) requires external assistance to restore.

An example: Say we are running a small 3round experiment and would like starfish to start converting each round as it comes off the microscope.

Since we are advised to treat each round as a separate experiment:

For the round 0 'experiment', write_experiment_json requests the round 0, channel 0 tile and is able to retrieve it. It is able to successfully fetch tiles for all other channels in the round as well.
For the round 1 'experiment', write_experiment_json requests the round 0, channel 0 tile and breaks because there are no round 0 tiles to be found.
The round 2 'experiment' is going to have the same problem as round 1

Even, if one implements the ugly hack of essentially making each round think it is the first round, you are left with a lot of metadata that contains 'wrong' information that requires external modification to resolve in order to piece together a full working experiment again.

A possible solution: Primary and Auxiliary image dimensions could be Iterables (or some similar data type) instead of ints. Using the per round analysis example, the 3rd round when calling write_experiment_json() should report its 'dimensions' to be:

primary_image_dimensions[Axes.ROUND] = [2]
primary_image_dimensions[Axes.CH] = [0, 1, 2, 3]
primary_image_dimensions[Axes.ZPLANE] = [0, 1, 2, ....]

So maybe TileFetchers or at least write_experiment_json() should be affected by this rework?

ttung commented 5 years ago

Yes, I agree that write_experiment_json() should take iterables instead of the cardinality of each dimension. You can see that build_image() has already been refactored to do this, and it's just a matter of finishing this work. :)

ttung commented 5 years ago

1374 adds this. I needed it for something else, so it seemed pretty high-leverage. :)

njmei commented 5 years ago

@ttung Sorry to keep adding more to your plate regarding this issue. Adding some formal examples to add to the slack convo we had yesterday:

Currently:

def build_image(
        fovs: Sequence[int],
        rounds: Sequence[int],
        chs: Sequence[int],
        zplanes: Sequence[int],
        image_fetcher: TileFetcher,
        default_shape: Optional[Mapping[Axes, int]]=None,
        axes_order: Sequence[Axes]=DEFAULT_DIMENSION_ORDER,
) -> Collection:

Some examples of things starfish conversion can't handle:

Example 1:
Round 1:
    - pri_probe_000 (channel 0)
    - pri_probe_001 (channel 1)
    - pri_probe_002 (channel 2)
    - DAPI (channel 3)

Round 3:
    - pri_probe_003 (channel 0)
    - pri_probe_004 (channel 1)
    - pri_probe_005 (channel 2)
    - pri_probe_006 (channel 3)

So being able specify that rounds are [1, 3] is awesome! But chs in build_image() need to be a function of round as chs cannot be both [0, 1, 2] and [0, 1, 2, 3].

Example 2:
Round 0:
    - pri_probe_000 (channel 0)
    - pri_probe_001 (channel 1)
    - pri_probe_002 (channel 2)
    - "no probe" (channel 3)
Round 1:
    - pri_probe_003 (channel 0)
    - pri_probe_004 (channel 1)
    - "no probe" (channel 2)
    - pri_probe_005 (channel 3)

Same problem here... chs are [0, 1, 2] in round 0 but [0, 1, 3] in round 1

I think one 'could' go more general than this and have things like fov, or z sequences as a function of each other or as a function of round/channel, but I think those are going to be very rare. The above situations I think could be fairly common. But this is probably a question to ask other collaborators...

ttung commented 5 years ago

We just need to accept a generator that produces (fov, round, ch, z).

neuromusic commented 4 years ago

since @ttung and @shanaxel42 implemented support for ragged and composite codebooks, I think we can close this. feel free to open a new issue if there are other ideas discussed here that we should track

spacetx / starfish

(composite, ragged) experiment and codebook support #1322

Objective

Acceptance Criteria

Notes

Thesis

Data model

Examples

Loading data when aux tiles are stored inside the primary image

Loading data when a primary image collection is not uniformly sized on round/ch/z

Loading data when a primary image that resides in multiple collections.

What's missing

1374 adds this. I needed it for something else, so it seemed pretty high-leverage. :)