Closed shanaxel42 closed 4 years ago
@ttung identified that putting all the images into auxiliary_images
is a feasible work-around for the time being.
This is correct, but I believe we need to be able to provide first-class support for SpaceTx assays. In particular, sequential smFISH seems likely to be a popular assay, and so it feels illogical to treat it as an edge case and solve the issue with a work around.
Although changes to this part of the format don't necessarily interact with the code base beyond the Experiment
and Codebook
classes, we're still in a position where we might be able to get away with deprecating support for older image formats, so I think it would be prudent to make a decision on this soon, before more external groups pick up SpaceTx Format.
We should not be opinionated about how people organize their data on disk. We should make it easy to load data regardless of how it is organized. Therefore, we should support the following models:
When people process their data, we should be opinionated about the data structure they load into. We should advocate that an ImageStack
represent a primary image or an auxiliary image, or subsets (ideally, all the tiles that are physically aligned in X-Y space) of one. An ImageStack
should not represent hold data from both a primary image and an auxiliary image, even if they are aligned in X-Y space.
Assuming ch1 contains the nuclei stains, and ch0, 2, 3 contains the primary image.
primary_image = exp['fov_001'].get_images(FieldOfView.PRIMARY_IMAGES, {Axes.CH: {0, 2, 3}})
nuclei_image = exp['fov_001'].get_images(FieldOfView.PRIMARY_IMAGES, {Axes.CH: {1}})
Loading each round into a separate imagestack.
round_0 = exp['fov_001'].get_images(FieldOfView.PRIMARY_IMAGES, {Axes.ROUND: {0}})
round_1 = exp['fov_001'].get_images(FieldOfView.PRIMARY_IMAGES, {Axes.ROUND: {1}})
We should also support loading all the data as a single ImageStack. In this case, the incomplete data should be written as NaN
. This has memory utilization implications, but we leave that to the user.
primary_image = exp['fov_001'].get_images(FieldOfView.PRIMARY_IMAGES)
round_0 = exp['fov_001'].get_images('round_0')
round_1 = exp['fov_001'].get_images('round_1')
round_2 = exp['fov_001'].get_images('round_2')
combiner = Filter.Combiner()
primary_image = combiner.run(round_0, round_1, round_2)
It may be necessary to provide a map between input round/channel/z values and output round/channel/z values.
get_image
/ get_images
(#1259)Combiner
filter.Was the first example meant to be the following?
primary_image = exp['fov_001'].get_images(FieldOfView.PRIMARY_IMAGES, {Axes.CH: {0, 2, 3}})
nuclei_image = exp['fov_001'].get_images(FieldOfView.PRIMARY_IMAGES, {Axes.CH: {1}})
For Combiner
I'd suggest the method live somewhere outside Filter
but I'm on board with the concept.
Everything else looks 👍
Oh yeah. Whoops copy paste. :)
What does the Combiner do? Does it create a single ImageStack out of a set of ImageStacks?
Modulo my last comment, this all seems very thoughtful and I'm totally on-board.
What does the Combiner do? Does it create a single ImageStack out of a set of ImageStacks?
It creates a single ImageStack out of a set of ImageStacks.
what concrete use-case are you thinking of for
which leads to the desire for a Combiner? Apologies for being slow here. By concrete use-case, I mean is there a group/assay that requires this and why?
a primary image that resides in multiple collections.
I don't think this is how people should organize data, but my thesis is that we shouldn't be opinionated about how people organize their data as long as it's labeled correctly. We should just provide tooling for people to ingest that data into starfish regardless of any questionable choices they may have made in data organization. :)
definitely agree with you there -- just wondering if anyone at spaceJam or anyone you've encountered has actually organized their data/processing in this way? or is 3. purely theoretical? i know for 1. and 2. i've seen actual examples, but for 3. i haven't.
(3) isn't purely theoretical. We advised this as a stopgap measure for uncoded assays while we figure out how to support things better. I'm hoping people don't use it longer than necessary, but we've opened the door.
Got it. I still don't get what problem (3) solves for un-coded assays though? Can you explain it to me?
Update: @ttung explained this to me offline. Basically, if I'm trying to load data into an ImageStack, where the individual tiles, across FOVs (for example) arise from different sizes (x, y, z) and there are some rounds/channels missing, you can still load all this into an ImageStack. Combiner will handle the logic of how this will actually all work.
@ttung @ambrosejcarr What are your thoughts on supporting contributors with stitched images since most of the data stored/processed in lab is stitched.
@zperova: short term solution is probably to build tooling to split a stitched image with overlaps.
Sorry for chiming in so late, but this does sound really great.
a few questions/comments
AlignedImageStack
s? Do those still exist? Will these methods have any constraints on the alignment?Combiner
and get_images()
? before of a more "normal" pipeline? Combiner
still know about / log the original round/channel/aux identity? starfish
@berl thanks for looking at this. Your input is really important, so I didn't want to get started without first getting it. :)
How will this relate to AlignedImageStacks? Do those still exist? Will these methods have any constraints on the alignment?
ImageStack
s will still be aligned on X-Y. FieldOfView.get_images(..)
will take a set of selections for each of r/ch/z (and these selections can be unspecified, which means "all"), and return a set of ImageStack
objects. Each of those will be aligned on X-Y.
Imagine that you have R={0, 1, 2, 3, 5, 6}. {0, 2} are aligned. {1} is by itself. {3, 5, 6} are aligned.
FieldOfView.get_images(r={0, 2, 5})
will yield two ImageStacks. The first will consist of R={0, 2}. The second will consist of R={5}.
How is a user with this kind of data going to run a pipeline on it? heavy use of the Combiner and get_images()? before of a more "normal" pipeline?
Combiner
is not going to be encouraged as it has some poor memory characteristics. Eventually when we build better lazy-loading semantics, it might work better.
We will encourage users to organize their data such that the primary_image is one Collection (and whether the auxiliary images are in that Collection, we're less opinionated about).
Will the output(s) (e.g. of the Combiner still know about / log the original round/channel/aux identity?
If the original r/ch/aux identity is unique, the Combiner should preserve them.
For instance, ImageStack 1 has R={0, 2, 3} and ImageStack 2 has R={4, 5}. If you combine them, we should get R={0, 2, 3, 4, 5}.
However, if you have ImageStack 1 has R={0, 2, 3} and ImageStack 2 has R={2, 5}, the user will have to provide a mapping between input R and output R, which should be recorded in provenance logging.
this will address half of #1121 many of the published methods that leverage barcoding (MERFISH, starmap, SeqFISH) also include non-barcoded parts of the experiment. Although the current idea doesn't fix the composite codebook problem (other half of #1121) it should allow that kind of data to live happily in starfish
I don't get what that's all about, but @dganguli is going to explain it to me tomorrow. :)
This system sounds good and looks like it should have the flexibility to deal with all sorts of interesting data arrangements.
Practically, will anything be changing for TileFetchers
and getting external data into spacetx format? It looks like the proposed system deals with weird data organization strictly after conversion to starfish format (e.g. separation of images into AUX and PRIMARY tiles)?
TileFetchers
should not be affected by these changes.
https://github.com/spacetx/starfish/issues/1322#issuecomment-491938207 We advised this as a stopgap measure for uncoded assays while we figure out how to support things better
As a side-note, use of spacetx-writer
in these cases is based on what is (confusingly) called the FileStitcher
API which matches your Combiner
. You can see https://docs.openmicroscopy.org/bio-formats/6.0.1/developers/file-reader.html#file-reading-extras for more info. We've never been able to get away from it. Looking forward to what you come up with!
We're removing the ability to load a ragged array (i.e., an array where the cardinality of round/channel/zplane is not uniform) into a single ImageStack. Instead, the prescribed method will be to load the data by round, and process them that way. See the block of text that has been struck out in https://github.com/spacetx/starfish/issues/1322#issuecomment-491466070
@ttung
Hmm... so stumbled upon a wrinkle while trying to figure out how to coerce starfish into converting data into starfish format on a per round basis that relates to this issue and dredges #1121 back from the grave to a certain extent as well.
The problem: Starfish does not appear to be able to convert standalone rounds (that know their own round index) without resorting to significant trickery and losing round information that (probably) requires external assistance to restore.
An example: Say we are running a small 3round experiment and would like starfish to start converting each round as it comes off the microscope.
Since we are advised to treat each round as a separate experiment:
For the round 0 'experiment', write_experiment_json
requests the round 0, channel 0
tile and is able to retrieve it. It is able to successfully fetch tiles for all other channels in the round as well.
For the round 1 'experiment', write_experiment_json
requests the round 0, channel 0
tile and breaks because there are no round 0
tiles to be found.
The round 2 'experiment' is going to have the same problem as round 1
Even, if one implements the ugly hack of essentially making each round think it is the first round, you are left with a lot of metadata that contains 'wrong' information that requires external modification to resolve in order to piece together a full working experiment again.
A possible solution:
Primary and Auxiliary image dimensions could be Iterables
(or some similar data type) instead of ints. Using the per round analysis example, the 3rd round when calling write_experiment_json()
should report its 'dimensions' to be:
primary_image_dimensions[Axes.ROUND] = [2]
primary_image_dimensions[Axes.CH] = [0, 1, 2, 3]
primary_image_dimensions[Axes.ZPLANE] = [0, 1, 2, ....]
So maybe TileFetchers
or at least write_experiment_json()
should be affected by this rework?
Yes, I agree that write_experiment_json()
should take iterables instead of the cardinality of each dimension. You can see that build_image()
has already been refactored to do this, and it's just a matter of finishing this work. :)
@ttung Sorry to keep adding more to your plate regarding this issue. Adding some formal examples to add to the slack convo we had yesterday:
Currently:
def build_image(
fovs: Sequence[int],
rounds: Sequence[int],
chs: Sequence[int],
zplanes: Sequence[int],
image_fetcher: TileFetcher,
default_shape: Optional[Mapping[Axes, int]]=None,
axes_order: Sequence[Axes]=DEFAULT_DIMENSION_ORDER,
) -> Collection:
Some examples of things starfish conversion can't handle:
Example 1:
Round 1:
- pri_probe_000 (channel 0)
- pri_probe_001 (channel 1)
- pri_probe_002 (channel 2)
- DAPI (channel 3)
Round 3:
- pri_probe_003 (channel 0)
- pri_probe_004 (channel 1)
- pri_probe_005 (channel 2)
- pri_probe_006 (channel 3)
So being able specify that rounds
are [1, 3]
is awesome! But chs
in build_image()
need to be a function of round as chs
cannot be both [0, 1, 2]
and [0, 1, 2, 3]
.
Example 2:
Round 0:
- pri_probe_000 (channel 0)
- pri_probe_001 (channel 1)
- pri_probe_002 (channel 2)
- "no probe" (channel 3)
Round 1:
- pri_probe_003 (channel 0)
- pri_probe_004 (channel 1)
- "no probe" (channel 2)
- pri_probe_005 (channel 3)
Same problem here... chs
are [0, 1, 2]
in round 0
but [0, 1, 3]
in round 1
I think one 'could' go more general than this and have things like fov, or z sequences as a function of each other or as a function of round/channel, but I think those are going to be very rare. The above situations I think could be fairly common. But this is probably a question to ask other collaborators...
We just need to accept a generator that produces (fov, round, ch, z).
since @ttung and @shanaxel42 implemented support for ragged and composite codebooks, I think we can close this. feel free to open a new issue if there are other ideas discussed here that we should track
Objective
As a spaceTX data analyst with an experiment that does not have a matching number of rounds and channels, I want starfish to support this data model, so that I can create a starfish benchmarking pipeline
Acceptance Criteria
Notes
Encapsulates work that needs to be done to support data coming in from Ola, Simone, Brian Long, and presumably RNAscope users