pachterlab / voyager

From geospatial to spatial -omics
https://pachterlab.github.io/voyager/
Artistic License 2.0
79 stars 9 forks source link

read/load image-based spatial transcriptomics #2

Closed alikhuseynov closed 1 year ago

alikhuseynov commented 1 year ago

Hi there,

Thanks for implementing this nice package, bridging geospatial data analysis with spatial omics analysis! :) Few questions, however these things are probably on your todo list already:

As for Vizgen/Merscope, recent version provides only Cellpose output, eg :

cellpose_mosaic_space.parquet # mols coords in px
cellpose_micron_space.parquet # mols coords in µm
cellpose_cell_by_gene.csv
cellpose_cell_metadata.csv

Something that I added using sfarrow for Seurat obj. for data loading for segs

Thanks A.

lambdamoses commented 1 year ago

Thanks for checking out this package. Since I already have the script to read Vizgen data, I do have a plan to turn that into a function, hopefully for the next Bioconductor release, version 3.17. Or would you like to pull request? I agree that it would be nice to have a function to convert from Seurat object. Since we have a lot to do for version 3.17, the Seurat part might come in 3.18. Those functions will go into the SpatialFeatureExperiment package instead of Voyager since they have more to do with the SFE object than the ESDA.

alikhuseynov commented 1 year ago

ok thanks, I will see what I can do this month for PR regarding read/load Vizgen data for SpatialFeatureExperiment repo. A wrapper to convert from Seurat object to SFE would be super handy for users (will work on that too). best

lambdamoses commented 1 year ago

Thank you very much for the PR! Since that's for the SFE repo, I'll close this issue.

lambdamoses commented 1 year ago

Are you still going to do the PR in the SFE repo? The next Bioconductor release is coming up. If not, then I'll add the Vizgen function myself.

alikhuseynov commented 1 year ago

sorry, I won't manage to do this on time. If you like, take a look at my modified version of loading Vizgen data using Seurat https://github.com/alikhuseynov/add-on_R/blob/develop/R/read_load_vizgen_v2.R Thanks!

lambdamoses commented 1 year ago

None of the example datasets on Vizgen's website has the cellpose parquet files. Can you share an example of the parquet file? This is for testing and examples.

alikhuseynov commented 1 year ago

yes, but it's large ~600MB, do you have a place where I can drop it?

lambdamoses commented 1 year ago

Can you upload it to this Box folder? https://caltech.app.box.com/f/3cc55ae6f7ad472abdd27acd660d8b90 I just need to take a look at the format so I can create a small toy example from a different dataset. Thanks!

alikhuseynov commented 1 year ago

ok, done! I found a smaller one ~37MB.. To load and view the whole file using sfarrow::st_read_parquet() I'm preparing to submit a PR to Seurat for handling parquet file best

lambdamoses commented 1 year ago

Thank you!

lambdamoses commented 1 year ago

My own function is getting very different from the one in Seurat. I'm looking at your code, and found this when one cell has multiple polygons:

https://github.com/alikhuseynov/add-on_R/blob/4569a7f51a5669f181179940c7f02207d9c2221e/R/read_load_vizgen_v2.R#L267-L281

                       if (which(unlist(test.segs) > 1) %>% any) {
                         segs.art.index <- which(unlist(test.segs) > 1)
                         message(segs.art.index %>% length, 
                                 " Cells have > 1 segmentaion boundary artifacts", "\n",
                                 "..removing artifacts", "\n",
                                 "..keeping cell boundaries with maximum coords")
                         # usually artifacts have small boundaries/coords
                         # find cell boundaries with maximum coords
                         for (i in segs.art.index %>% seq) { 
                           dims <- lapply(segs[[segs.art.index[i]]][[1]] %>% seq(), 
                                          function(d) { dim(segs[[segs.art.index[i]]][[1]][[d]])[1] } )
                           # get & keep boundaries with maximum coords
                           maxs.segs <- which(unlist(dims) == max(unlist(dims)))  
                           segs[[segs.art.index[i]]][[1]] <- segs[[segs.art.index[i]]][[1]][maxs.segs]
                         }

I'm not sure if I entirely agree with this approach, because upon inspection, in some cases Cellpose somehow assigned two cells to one ID but both pieces are small so can both be artifact, like this:

Screenshot 2023-04-14 at 2 12 09 AM

The smaller piece doesn't quite look like artifact. But in some cases, the small bit does look like artifact:

Screenshot 2023-04-14 at 2 12 18 AM

I think I can add another argument for minimum cell size and issue a warning if some cells have multiple pieces that are all larger than the minimum size. I also check if the cell polygons are valid geometries and issue a warning if anything is invalid.

alikhuseynov commented 1 year ago

yeah :) I know that problem.. Sometimes Cellpose output has multiple polygons for same cell (it's not always the case, depends on the sample and how well the segmentation worked). In this case I thought to just get the largest polygon as single cell boundary. I think Seurat object won't allow multiple polygons for same cell ID. Your suggestion of a new argument for minimum cell size would be a good idea, plus warnings etc.. Thanks!

btw, What's is different in the output from your own function as compared to mine/Seurat?

lambdamoses commented 1 year ago

Mine returns a SpatialFeatureExperiment object, and I read the image with the terra package so the image is not loaded into memory unless necessary, and when plotting a larger image, it is downsampled to make plotting faster. Haven't pushed it yet.

MULTIPOLYGONs are technically allowed but it doesn't make biological sense for whole cell segmentation. Technically allowed because there can be other cell level geometries that have multiple pieces, like endoplasmic reticulum segmentation.

Also I'm not reading in the transcript spots yet. For MERFISH datasets the csv file can be over 30 GB in size and I don't know how to deal with the size and I'm not sure how useful it is to load it. I can do spatial point process analyses on the transcript spots, but I don't know what to do with the large number of cells both computationally and statistically. I need to look into sedona for on disk representation of Simple Features before dealing with the transcript spots.

alikhuseynov commented 1 year ago

you mean detected_transcripts.csv from Vizgen? In our case single file was like 700MB.

We are trying out Segmentation-free approach (don't have a push yet), where only molecule coordinates are used (no image or segmentations, only x, y, gene coordinates and molecule name) The idea is to use some form of quantization, ie Spatial binning (borrowed from geospatial stats) on molecule coordinates, given a desired bin_size.

This makes it:

See the attached example seg_free_mols

lambdamoses commented 1 year ago

OK, so there are cases where that file isn't too large and more manageable in memory. I was a bit intimidated by the example datasets on Vizgen's website. But still, I think I should consider sedona because it's only helpful to plot the transcript spots when zoomed into small areas so it would be nice to not load the whole thing into memory, because the analyses are typically performed at the cell or in your case bin level where the location of transcript spots within cells is irrelevant. Besides spatial point process analysis, plotting is the only case I can think of where the transcript spot geometry is relevant. I don't think I'm able to do the sedona thing for Bioconductor 3.17. Maybe I can do it for 3.18 or 3.19.

alikhuseynov commented 1 year ago

Ok, thanks. I agree with you, many things are mostly for visualization. In 3.17, will you have a function to convert SpatialFeatureExperiment object to Seurat and back?

lambdamoses commented 1 year ago

Unfortunately, I don't think I have enough time to write the Seurat conversion function for 3.17. I still have to write new vignettes for the new functionalities before April 26.

lambdamoses commented 1 year ago

Here's my function: https://github.com/pachterlab/SpatialFeatureExperiment/blob/e0e9618492bf3de5407caa0d6b6933954f5850d6/R/read.R#L279