pachterlab / SpatialFeatureExperiment

Extension of SpatialExperiment with sf
https://pachterlab.github.io/SpatialFeatureExperiment
Artistic License 2.0
43 stars 7 forks source link

Add readXenium #10

Closed lambdamoses closed 9 months ago

lambdamoses commented 1 year ago

Should be super easy. The harder part is to add Xenium toy dataset for unit testing.

alikhuseynov commented 1 year ago

once PR #12 for readVizgen() is fine, I can make readXenium() in a similar way, this time base R version. Unless you are already working on it? cheers

lambdamoses commented 1 year ago

I haven't started working on it yet. Still working on the paper. If you would like to work of readXenium(), then that would be great.

alikhuseynov commented 1 year ago

ok, great! I can get my hand on that, we have few Xenium datasets to test that function as well.

lambdamoses commented 1 year ago

I'm playing around with some Xenium data from the 10X website. Looks like the z coordinates of the transcript spots are continuous so I can't simply select a z-plane. Now I'm debating what to do with the z coordinates.

BTW, have you started working on readCosMX? If not then I'll start writing it. Some wet lab colleagues said that Xenium and CosMX data quality and their curated gene panels are pretty bad. I wonder what you think of the data quality.

alikhuseynov commented 1 year ago

Yeah, Xenium output is bit different. I can try to make it as for readVizgen (using your last commits). I'm using "_xenium_V1_FFPE_Human_Brain_Healthy_With_Addonouts" public data from 10X as a test data. For Seurat, someone from 10X implemented ReadXenium & LoadXenium. For transcripts, only 2D xy coords are read in (without z plane selection). The only filtering they do is on the phred-scaled qv col, default is qv >= 20, this you already have in formatTxSpots, support for Xenium format would need to be added by skipping z arg selection, ie if z coords are non-integers. Are you still working on addTxSpots?

Cell segmentations don't have z col, only cell_id and xy coords. For tissue image, the DAPI morphology_mip.ome.tif can be used, (it's maximum intensity projection of 3D z-stack)

Also, I think for count matrices, we would need to use altExp for each "Feature type" (Gene Expression, Negative Control Codeword, Negative Control Probe,Unassigned Codeword,Deprecated Codeword), since each of them have different number of genes/probes. Now the question is, what format altExp should be? SFE object with "centroids" in colGeometry and optionally transcripts in rowGeometry for each altExp ? This is also relevant for SFE <-> Seurat converter (which I will try to finalize this week)

I haven't touched readCosMX, go ahead 👍 As for data quality, we recently started providing Xenium service. As far as I know, a number of research groups here already use and prefer that technology, maybe quality depends on the which tissue is used as well. And Xenium will have 5K gene panel soon. We don't use Nanostring tech. at all.

lambdamoses commented 1 year ago

So far I'm not dropping z (df2sf supports 3D geometries when 3 columns are specified for the coordinates), but the user can easily do so with st_zm. I don't have to match what Seurat is doing. I feel like the z coordinates could be interesting to plot at some point. I'm still working on addTxSpots, which is just to add the output of formatTxSpots to the SFE object. I think feature type can simply be a column in rowData, so anything that is not gene expression can be used for QC and later dropped after QC. I'm not sure about altExp.

As for data quality, we recently started providing Xenium service.

So are you in a core facility? Also I heard about that 10X vs. Nanostring lawsuit that banned sales of CosMX in Germany.

alikhuseynov commented 1 year ago

yeah, I get it. Support for plotting 3D coords would be interesting, I think using interactive plots like plotly would be nice. I did work before with surfaces using rgl & mesh3d objects of volumetric data (CT data). For Visium data, I was using one gene expression at a time (as a surface topography, also using gaussian smoothing kernel), where z elevation was the expression magnitude. Some time ago, I tried 3D umap with ~150 single cells using plotly, was pretty nice, but demanding on computer graphics. So, probably something like that could be done.

ok, once you are ready with addTxSpots I will try the readVizgen from your last commit. rowData sounds good, I will try to use it for all non gene expression stuff. Thanks!

As for data quality, we recently started providing Xenium service.

So are you in a core facility? Also I heard about that 10X vs. Nanostring lawsuit that banned sales of CosMX in Germany.

yes, we are part of core facility for spatial and single cell techs.

lambdamoses commented 1 year ago

Can you write unit tests and examples for your pull readXenium and Seurat conversion pull requests? I have too many other things to write for Bioc 3.18 so it would be great if you write these unit tests. That would involve creating small toy datasets for testing and examples in function documentations.

I think I'll write new unit tests for readVizgen since I wrote a lot of code for formatting the transcript spots.

alikhuseynov commented 1 year ago

Can you write unit tests and examples for your pull readXenium and Seurat conversion pull requests? I have too many other things to write for Bioc 3.18 so it would be great if you write these unit tests. That would involve creating small toy datasets for testing and examples in function documentations.

I think I'll write new unit tests for readVizgen since I wrote a lot of code for formatting the transcript spots.

Sounds good. However, I don't have much experience with unit tests, but could try to do that. I always test functionality on my side (on few different datasets) before submitting PRs.

lambdamoses commented 1 year ago

Here's a guide to get started: https://r-pkgs.org/testing-basics.html I think writing unit tests is worth the trouble, since it's easier to run them after writing new code and to check if existing code is broken by the new code or bug fix than to run less formal tests.

alikhuseynov commented 1 year ago

as for images from Xenium output, the problem is the ome.tif format for images, eg morphology_mip.ome.tif. terra::rast cannot load it Cannot open TIFF file due to missing codec JP2000. (GDAL error 1)” I'm thinking to have an internal function to load ome.tif and convert it to SpatRaster. SpatialOmicsOverlay might help here. Any suggestions?

lambdamoses commented 1 year ago

Looks like it's related to the compression scheme. I got an error in tiff::readTIFF: pkg:tiff: Compression scheme 34712 tile decoding is not implemented. However, I can open the ome.tiff file with Bio-Formats in FIJI and the python tifffile package. I'm not sure what to do. The whole point of terra is not to load the entire thing into memory, while FIJI and tifffile do load the image into memory.

lambdamoses commented 1 year ago

I tried RBioFormats, which is used behind the scene in SpatialOmicsOverlay. It works. In case of limited RAM, then read a lower resolution, which is sufficient for plotting in most cases. Can also use the subset argument of the read.image function to only read in a small subset when plotting only a few cells. Then I need to modify the SFE package so SFE class won't always use terra.

alikhuseynov commented 1 year ago

Seems that ome.tif related compression is not yet implemented..

The whole point of terra is not to load the entire thing into memory, while FIJI and tifffile do load the image into memory.

yeah, exactly that. What if to implement an internal converter from ome.tif to .tif using RBioFormats or similar, and export the converted .tif (low resolution images) to the out dir of Xenium, then loading object and adding images with terra as usual? Else, using only RBioFormats for when ome.tif files are present.

alikhuseynov commented 1 year ago

readXenium is kind of ready (without adding images), I tested locally for some datasets, works well. I can add it to seurat_v4 branch. The issue is with adding images, I haven't figured how to efficiently load ome.tif file, it doesn't work with SpatialOmicsOverlay since it's tailored for Nanostring data. It works with RBioFormats alone but terra::rast won't load it after converting to .tif, magick can load it but that would be another dependency then.

# eg
library(SpatialOmicsOverlay)
library(RBioFormats)
img_fn <- "./morphology_focus.ome.tif"
xml <- xmlExtraction(ometiff = img_fn[1]) # works!
scanMeta <- parseScanMetadata(omexml = xml) # doesn't work
# this works
img <- read.image(file = img_fn[1], 
                  resolution = 6, 
                  #read.metadata = FALSE, 
                  #normalize = FALSE
                 )
# then export image
write.image(img, file = gsub("ome.", "", img_fn[1]), force = TRUE)
# read with R `magick` works
img_test <- image_read(path = gsub("ome.", "", img_fn[1]))
# then we need to convert to SpatRaster
img_test <- img_test |>
    as.raster() |> as.matrix() |> rast()
plot(img_test) # works

Do you have your example of reading/converting Xenium output ome.tif file like morphology_focus.ome.tif using RBioFormats and loading it with terra? Any suggestions on how shall we deal with loading Xenium images? Thanks

alikhuseynov commented 1 year ago

making toy dataset for Xenium would be a bit hard, like for zarr.zip files etc.. I think one can just keep files that are used in readXenium and ignore the rest?

alikhuseynov commented 1 year ago

I'm going to add internal function to load ome.tif with RBioFormats and save as .tif with a selected resolution in read.image . It seems to load with terra::rast afterwards

alikhuseynov commented 1 year ago

It's accessible now -> SpatialFeatureExperiment:::readXenium. The only issue is that when plotting cell polygons and image with Voyager, the image doesn’t align 100% with colGeometry. And I couldn't find which json file to use for bbox as for readVizgen. For some datasets it overlays well, for other not really, see below

# plot Segs
options(repr.plot.height = 5, repr.plot.width = 10)
pl1 <- 
plotSpatialFeature(sfe, features = rownames(sfe)[1],
                   size = 4,
                   colGeometryName = "cell",
                   dark = TRUE,
                   image_id = imgData(sfe)$image_id[1]
                  ) & Seurat::DarkTheme()
pl1

xenium_V1_FFPE_Human_Brain_Healthy_With_Addon -> some shift Bildschirmfoto 2023-11-20 um 15 28 10

/xenium_prerelease_sept15_hBreast/hBreast_ffpe_small/ -> overlays well Bildschirmfoto 2023-11-20 um 15 35 21

bbox_use <- c(xmin = 2000, ymin = -2500, xmax = 2500, ymax = -2000) Bildschirmfoto 2023-11-20 um 15 33 49

alikhuseynov commented 1 year ago

I tested it on earlier dataset Xenium_FFPE_Human_Breast_Cancer_Rep1 from 10X Xenium paper. It works, the image overlaps well with cells or nuclei polygons.

Bildschirmfoto 2023-11-28 um 22 06 23

bbox_use <- c(xmin = 6400, ymin = -3500, xmax = 7000, ymax = -2500) Bildschirmfoto 2023-11-28 um 22 07 43

I think the mismatch between image and cell/nuclei polygons that occurs for latest datasets, has to do with some shifts, that displace the segmentation masks from the image they came from. I haven't figured when it occurs. So, using extent <- colGeometry(sfe, 1) |> st_geometry() |> st_bbox() should be fine, since images to overlap with cell/nuclei polygons at least for 2 public datasets:

alikhuseynov commented 12 months ago

seems for some new datasets (standard XeniumRanger outs) the .ome.tif images are no well aligned with segmentation polygons. XeniumExplorer is used to do image registration (affine) and transformation matrix can be saved (all manual though). I think image registration needs to be used before adding to imgData, already existing packages can be used, which would mean adding yet another dependency.

lambdamoses commented 12 months ago

I think a manual step is acceptable; for Visium, there's also the option to manually align the spots and the image and to manually find which spots are in tissue in the Loupe browser. Or maybe use ImageJ for the registration. You don't have to stick to terra for the image here; it's very hacky to rotate the image with terra, plus terra has trouble reading the .ome.tif file anyway. If still using terra, then I'm thinking about using an engineering CRS and define a math transform in WKT, or do something with GDAL; in geography rotation is more commonly performed in CRS transformations and those CRS's are already pre-defined. Unfortunately R isn't great for image processing. The only R package on CRAN or Bioc that does affine registration I know of is RNiftyReg. It's fine to add to Suggests.

lambdamoses commented 12 months ago

making toy dataset for Xenium would be a bit hard, like for zarr.zip files etc.. I think one can just keep files that are used in readXenium and ignore the rest?

Sounds good. The purpose of the toy dataset is just to test readXenium so we don't need the other stuff.

alikhuseynov commented 12 months ago

making toy dataset for Xenium would be a bit hard, like for zarr.zip files etc.. I think one can just keep files that are used in readXenium and ignore the rest?

Sounds good. The purpose of the toy dataset is just to test readXenium so we don't need the other stuff.

yeah, only those relevant to readXenium👍

alikhuseynov commented 12 months ago

I think a manual step is acceptable; for Visium, there's also the option to manually align the spots and the image and to manually find which spots are in tissue in the Loupe browser. Or maybe use ImageJ for the registration. You don't have to stick to terra for the image here; it's very hacky to rotate the image with terra, plus terra has trouble reading the .ome.tif file anyway. If still using terra, then I'm thinking about using an engineering CRS and define a math transform in WKT, or do something with GDAL; in geography rotation is more commonly performed in CRS transformations and those CRS's are already pre-defined. Unfortunately R isn't great for image processing. The only R package on CRAN or Bioc that does affine registration I know of is RNiftyReg. It's fine to add to Suggests.

yeah I agree, I think keeping terra is ok for now. I might found a way round (using affine transforms in Rvcg and/or Morpho) to register image to colGeometry and then using that transform or ext for SpatRaster, need to test, will let you know if it works. RNiftyReg is an option. Fortunately, the colGeometry and rowGeometry overlap always. I think image registration in Xenium pipeline doesn't always work well, which creates those shifts for some samples.

lambdamoses commented 11 months ago

About the image alignment issue: Xenium Explorer must have the info to align the cell segmentation and the image. I think I found where that info resides: in the pixel_size field in the experiment.xenium file, which is basically json. That's the pixel size in the morphology.ome.tif image file (in µm) (the highest resolution), which has the same xy dimension as the other two morphology image files. The bounding box of the cell segmentation doesn't match the dimensions of the image because much of the image doesn't have cells. So unless you want to load other images, registration isn't necessary.

lambdamoses commented 11 months ago

I played around with RBioFormats; we can read a subset of the image. I think I'll write another S4 class, inheriting from VirtualSpatialImage in SpatialExperiment for BioFormats (I'll call it BioFormatsImage) with an optional added slot for the extent so only the subset within the extent from the desired resolution is loaded into memory when needed such as when plotting, especially when a high resolution is used so we don't load the entire high resolution image into memory. Otherwise only the file path and the extent are in memory. This would require more changes in SFE and Voyager to accept images other than SpatRaster, but I think it's worth it so SFE and Voyager can work more smoothly with ome.tiff and BioFormats in general.

alikhuseynov commented 11 months ago

Regarding the image and segmentations mismatch, after back-and-forth with colleagues here, we found out the following: In micron and in pixel space, segmentations and transcripts coordinates match well with an image at full resolution. OME-TIFF comes with up to 8 image resolutions. When read with RBioFormats::read.image OME image metadata needs be read separately. Now, things match well (no shifts) for low resolution images when some cell/nuclei segmentations are present at image borders (towards 0,0 corner).

The bounding box of the cell segmentation doesn't match the dimensions of the image because much of the image doesn't have cells

Exactly that 👍

So, things get shifted for low resolution images only when no segmentations are present at the borders. Here the transform

c(0, full_res_img_width * px_size_micron, 
  0, full_res_img_heigth * px_size_micron)

is needed to set the extent of the low resolution image to match the polygons geometries in micron space. The relevant information for that transform needs to be extracted from metadata using read.omexml.

# eg
meta_ome <-
    read.omexml(grep("_mip.ome.tif", img_fn, value = TRUE)) |>
    strsplit(x = _, split = " ") |> unlist() |>
    grep("PhysicalSize|Size[XY]", x = _, value = TRUE)
# here we get px_size_micron and the full_res_img_width -> SizeX, full_res_img_heigth -> SizeY
meta_ome
> 'PhysicalSizeX="0.2125"''PhysicalSizeXUnit="µm"''PhysicalSizeY="0.2125"''PhysicalSizeYUnit="µm"''SizeX="36955"''SizeY="27282"'

I can implement that now within the readXenium, also after image is read with img <- read.image(...) it can be directly converted to SpatRaster img@.Data |> t() |> rast() and add corrected ext, then (optionally) save as .tif with terra::writeRaster. That would be a solution for now, creating new S4 class BioFormatsImage is a good idea to make things more efficient for ome.tif files. As for resolution in read.image, for larger/full dataset I could only load resolution = 4 or lower, it won't load higher ones. The morphology.ome.tif is a 3D stacked image, morphology_mip.ome.tif is MIP of that 3D image and has the corresponding full resolution as well, so we don't need to load the large 3D image unless really needed.

lambdamoses commented 11 months ago

For the highest resolution, this works:

img <- read.image("morphology_mip.ome.tif", normalize = FALSE, resolution = 1, 
subset = list(x = 12000:13000, y = 12000:13000))

I think the high resolution is important when plotting a small number of cells, very zoomed in. I used the human skin dataset from the 10X website. The original file is over 300 MB.

alikhuseynov commented 11 months ago

Potentially, to add support for px pixel space, when unit = "pixel" would be useful too, I think. Could work on that later.

lambdamoses commented 9 months ago

Pretty much done in https://github.com/pachterlab/SpatialFeatureExperiment/commit/5d1282ce1024760720bfc606ab936f752bf3b28d though need more extensive testing