Closed nzhang89 closed 2 months ago
@nzhang89 thanks for posting an issue about this - I heard Vizgen was planning to modify the output segmentation format, but I have not yet seen what these files look like. Any chance you could share some data with me in this format? My email is ahartman@nygenome.org. I can then go ahead and update the LoadVizgen function to support cell boundaries in parquet format.
In the meantime, the steps for creating a Segmentation
object, which stores the boundary information in a Seurat object, are relatively straightforward. You'll want to construct a data.frame with x, y, and cell columns. Each row represents a vertice in a segmentation for a given cell, so you'll have many rows per cell. You can then run segs <- CreateSegmentation(df)
to generate a Segmentation
object. Please let me know if you have any questions!
Thanks for addressing this - having exactly the same issue here.
Looking at Vizgen's latest user guide: https://vizgen.com/resources/merscope-instrument/?_gl=1labipg_upMQ.._gaMTA1NzM5MzYyNi4xNjgwMjQ5NjM2_ga_4DCLSHRRJC*MTY4MDI0OTYzNi4xLjAuMTY4MDI0OTYzNi4wLjAuMA..
it seems that with MERSCOPE instrument software v232 or later, they've moved to parquet files for the cell boundaries instead of the hdf5 format.
Hey @andrewjkwok, thank you for your patience and pointing me to the user guide. I've reached out to Vizgen for additional information, and I'll update you here once I have a solution
Vizgen now has Cellpose segmentation output stored as .parquet
. I had to modify ReadVizgen()
and LoadVizgen()
.
See this script
and example to make Seurat obj of Vizgen data
@AustinHartman probably I should submit PR at some point, unless you guys implementing similar stuff already?
Additionally, there were some issue using subset()
#6683 and #6767
in order to solve that as intermediate solution, I implemented this script
Hope this helps.
best
@alikhuseynov thank you so much! A PR to handle parquet files would be much appreciated
@AustinHartman sure, I like Seurat package and it's always nice to contribute. I will try to work on that.
PR submitted #7190
Thnkas @alikhuseynov for the update. I installed the branch that you updated, but unfortunately encounter an error sayign:
Error: file must be a "InputStream"
Do you have any idea what the issue might be? When I look at the traceback, this is what I get:
8: stop(assertError(attr(res, "msg")))
7: assert_that(inherits(object, class), msg = msg)
6: assert_is(file, "InputStream")
5: make_readable_file(file, mmap)
4: arrow::ParquetFileReader$create(dsn, ...)
3: sfarrow::st_read_parquet(file2read)
2: ReadVizgen(data.dir = data.dir, ...)
1: LoadVizgen(data.dir = "/mnt/z/Users/Andrew_K/Projects/spatial_transcriptome/Data/Test_MERFISH_run_16032023/202303171059_20230317_VMSC04301_firstcopy/region_0/")
How does the content look like of these directories?
./region_0/
directory.parquet
filesCould you post your LoadVizgen()
run/script?
thanks
My ./region_0/
directory:
202303171059_20230317_VMSC04301_firstcopy_region_0.vzg cell_by_gene.csv cell_metadata_truncated.csv images
cell_boundaries.parquet cell_metadata.csv detected_transcripts.csv summary_20230317.png
I'm not sure about the directory containing the Cellpose output - the .parquet file seems to be in the ./region_0/
directory?
My script currently involves nothing other than loading the libraries and running the command!
options(Seurat.object.assay.version = 'v5')
library(Seurat)
library(future)
library(ggplot2)
library(magrittr)
plan("multisession", workers = 10)
vizgen.obj <- LoadVizgen(data.dir = "/mnt/z/Users/Andrew_K/Projects/spatial_transcriptome/Data/Test_MERFISH_run_16032023/202303171059_20230317_VMSC04301_firstcopy/region_0/")
in our case the Vizgen directory content looks like this:
region_0/
├── xxx_region_0.vzg
├── Cellpose_polyT_CB3
├── detected_transcripts.csv
├── images
└── summary.png
region_0/Cellpose_polyT_CB3/
├── xxx_region_0_cellpose.vzg
├── cellpose_cell_by_gene.csv
├── cellpose_cell_metadata.csv
├── cellpose_micron_space.parquet
└── cellpose_mosaic_space.parquet
I would assume that your cell_boundaries.parquet
is the same as cellpose_micron_space.parquet
?
Since I don't have access to the test data, could you run the following and see if the content of your cell_boundaries.parquet
is more or less similar?
file2read <- "./region_0/cell_boundaries.parquet"
test_parq <- sfarrow::st_read_parquet(file2read)
# in my case the output looks like that:
test_parq %>% str
Classes ‘sf’ and 'data.frame': 559790 obs. of 10 variables:
$ ID : int 0 312 104 156 260 208 52 105 209 53 ...
$ EntityID :integer64 112833000000100001 112833000000100001 112833000000100001 112833000000100001 112833000000100001 112833000000100001 112833000000100001 112833000000100002 ...
$ ZIndex : int 3 6 4 1 0 5 2 4 5 2 ...
$ Type : chr "cell" "cell" "cell" "cell" ...
$ ZLevel : num 6 10.5 7.5 3 1.5 9 4.5 7.5 9 4.5 ...
$ Name : 'vctrs_unspecified' logi NA NA NA NA NA NA ...
$ ParentID : 'vctrs_unspecified' logi NA NA NA NA NA NA ...
$ ParentType : 'vctrs_unspecified' logi NA NA NA NA NA NA ...
$ X__index_level_0__: int 0 348 116 174 290 232 58 117 233 59 ...
$ Geometry :sfc_MULTIPOLYGON of length 559790; first list element: List of 1
..$ :List of 1
.. ..$ : num [1:12, 1:2] 6805 6807 6808 6809 6809 ...
..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
- attr(*, "sf_column")= chr "Geometry"
- attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA
..- attr(*, "names")= chr [1:9] "ID" "EntityID" "ZIndex" "Type" ...
# then filter for mid Z plane
test_parq %>% filter(ZIndex == 3) %>% pull(Geometry) %>% str
sfc_MULTIPOLYGON of length 79970; first list element: List of 1
$ :List of 1
..$ : num [1:12, 1:2] 6805 6807 6808 6809 6809 ...
- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
Huh this is strange. Sorry for the poor formatting earlier. This is the directory structure:
region_0/
├── 202303171059_KoLab20230317_VMSC04301_region_0.vzg
├── cell_boundaries.parquet
├── cell_by_gene.csv
├── cell_metadata.csv
├── cell_metadata_truncated.csv
├── detected_transcripts.csv
├── images
│ ├── manifest.json
│ ├── micron_to_mosaic_pixel_transform.csv
│ ├── mosaic_DAPI_z0.tif
│ ├── mosaic_DAPI_z1.tif
│ ├── mosaic_DAPI_z2.tif
│ ├── mosaic_DAPI_z3.tif
│ ├── mosaic_DAPI_z4.tif
│ ├── mosaic_DAPI_z5.tif
│ ├── mosaic_DAPI_z6.tif
│ ├── mosaic_PolyT_z0.tif
│ ├── mosaic_PolyT_z1.tif
│ ├── mosaic_PolyT_z2.tif
│ ├── mosaic_PolyT_z3.tif
│ ├── mosaic_PolyT_z4.tif
│ ├── mosaic_PolyT_z5.tif
│ └── mosaic_PolyT_z6.tif
└── summary_20230317.png
1 directory, 23 files
So it seems our directory structures are a bit different, as the parquet file is directly in my ./region_0
/ directory?
The parquet file seems to be more or less similar?:
> test_parq %>% str
Classes ‘sf’ and 'data.frame': 1018750 obs. of 10 variables:
$ ID : int 0 1 2 3 4 5 6 7 8 9 ...
$ EntityID :integer64 1771792400012100003 1771792400012100004 1771792400012100006 1771792400012100008 1771792400012100009 1771792400012100015 1771792400012100016 1771792400012100018 ...
$ ZIndex : int 0 0 0 0 0 0 0 0 0 0 ...
$ Type : chr "cell" "cell" "cell" "cell" ...
$ ZLevel : num 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 ...
$ Name : 'vctrs_unspecified' logi NA NA NA NA NA NA ...
$ ParentID : 'vctrs_unspecified' logi NA NA NA NA NA NA ...
$ ParentType : 'vctrs_unspecified' logi NA NA NA NA NA NA ...
$ X__index_level_0__: int 0 1 2 3 4 5 6 7 8 9 ...
$ Geometry :sfc_MULTIPOLYGON of length 1018750; first list element: List of 1
..$ :List of 1
.. ..$ : num [1:98, 1:2] 1501 1501 1502 1506 1507 ...
..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
- attr(*, "sf_column")= chr "Geometry"
- attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA
..- attr(*, "names")= chr [1:9] "ID" "EntityID" "ZIndex" "Type" ...
> test_parq %>% filter(ZIndex == 3) %>% pull(Geometry) %>% str
sfc_MULTIPOLYGON of length 152345; first list element: List of 2
$ :List of 1
..$ : num [1:5, 1:2] 1539 1539 1539 1540 1539 ...
$ :List of 1
..$ : num [1:89, 1:2] 1501 1502 1503 1506 1509 ...
- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
No worries, sound good. Vizgen did updates on Merscope output as far as we know.
Also, note that Vizgen is using only one z-plane, the middle plane of the tissue for Cellpose segmentation. So, the cell boundaries in all $ZIndex
are the same.
I will add support for reading any single .parquet
file in the given directory. For now, if you want, just rename the cell_boundaries.parquet
to cellpose_micron_space.parquet
.
Then run this chunk to load data using BiocParallel
instead of future
### make sure all libs are installed a priori!
## load libs ----
suppressPackageStartupMessages({
library(ggplot2)
library(Seurat)
library(dplyr)
library(magrittr)
library(BiocParallel)
})
# set params
mol.type <- "microns"
coord.space <- "micron"
z.stack <- 3L # z stack to use
dir_use <- "./region_0/"
# load object
obj.vz <-
LoadVizgen(data.dir = dir_use,
fov = "test.sample",
assay = "Vizgen",
use.cellpose.out = TRUE, # if TRUE and ./Cellpose dir exists
metadata = c("volume", "fov"), # add cell volume info
mol.type = mol.type, # molecule coords in µm space
coord.space = coord.space,
type = c("segmentations", "centroids", "boxes"), # type of cell spatial coord matrices
z = z.stack,
add.zIndex = TRUE, # add z slice section to the object
update.object = TRUE,
use.BiocParallel = TRUE, # using `BiocParallel` instead of `future`
workers.total = 10, # cores for `BiocParallel` processing
DTthreads.pct = NULL # percentage of total threads to use for `data.table::fread`
)
just add the support #7190, seems to work.
just re-install: remotes::install_github(repo = 'alikhuseynov/seurat', ref = 'feat/vizgen')
Thanks for such quick responses.
I've reinstalled as you've instructed. The issue coming up now is:
Error in is.empty(.) : could not find function "is.empty"
> traceback()
3: files2scan %>% grep(coord.space, ., value = TRUE) %>% {
if (is.empty(.)) {
files2scan %>% grep(".parquet$", ., value = TRUE)
}
else {
(.)
}
}
2: ReadVizgen(data.dir = data.dir, ...)
1: LoadVizgen(data.dir = dir_use, fov = "test.sample", assay = "Vizgen",
use.cellpose.out = TRUE, metadata = c("volume", "fov"), mol.type = mol.type,
coord.space = coord.space, type = c("segmentations", "centroids",
"boxes"), z = z.stack, add.zIndex = TRUE, update.object = TRUE,
use.BiocParallel = TRUE, workers.total = 10, DTthreads.pct = NULL)
The current directory structure is:
region_0/
├── 202303171059_KoLab20230317_VMSC04301_region_0.vzg
├── cell_by_gene.csv
├── cell_metadata.csv
├── cell_metadata_truncated.csv
├── cellpose_micron_space.parquet
├── detected_transcripts.csv
├── images
│ ├── manifest.json
│ ├── micron_to_mosaic_pixel_transform.csv
│ ├── mosaic_DAPI_z0.tif
│ ├── mosaic_DAPI_z1.tif
│ ├── mosaic_DAPI_z2.tif
│ ├── mosaic_DAPI_z3.tif
│ ├── mosaic_DAPI_z4.tif
│ ├── mosaic_DAPI_z5.tif
│ ├── mosaic_DAPI_z6.tif
│ ├── mosaic_PolyT_z0.tif
│ ├── mosaic_PolyT_z1.tif
│ ├── mosaic_PolyT_z2.tif
│ ├── mosaic_PolyT_z3.tif
│ ├── mosaic_PolyT_z4.tif
│ ├── mosaic_PolyT_z5.tif
│ └── mosaic_PolyT_z6.tif
└── summary_20230317.png
1 directory, 23 files
i.e. I've renamed my parquet file to what you've suggested?
Ok, given the last fix, you don't need to rename the .parquet
file, it can be cell_boundaries.parquet
or cellpose_micron_space.parquet
is.empty()
is a useful function from spatstat.geom
package, in my case (spatstat.geom_3.0-3) is loaded but not attached (see session info below) once I load Seurat package. Try to install spatstat.geom
first, then clean your session, load libraries and run the script. Then it should work.
below is my session info, just in case.
sessionInfo()
R version 4.2.1 (2022-06-23)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS/LAPACK: /media/../yyy/lib/libopenblasp-r0.3.21.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] SeuratObject_4.1.3 Seurat_4.3.0.9002
loaded via a namespace (and not attached):
[1] nlme_3.1-159 spatstat.sparse_3.0-0 matrixStats_0.62.0
[4] RcppAnnoy_0.0.19 RColorBrewer_1.1-3 httr_1.4.4
[7] repr_1.1.4 sctransform_0.3.5 tools_4.2.1
[10] utf8_1.2.2 R6_2.5.1 irlba_2.3.5
[13] KernSmooth_2.23-20 uwot_0.1.14 lazyeval_0.2.2
[16] colorspace_2.0-3 sp_1.5-0 gridExtra_2.3
[19] tidyselect_1.2.0 compiler_4.2.1 progressr_0.11.0
[22] cli_3.4.1 spatstat.explore_3.0-5 plotly_4.10.0
[25] scales_1.2.1 spatstat.data_3.0-0 lmtest_0.9-40
[28] ggridges_0.5.4 pbapply_1.7-0 goftest_1.2-3
[31] stringr_1.4.1 pbdZMQ_0.3-7 digest_0.6.29
[34] spatstat.utils_3.0-1 base64enc_0.1-3 pkgconfig_2.0.3
[37] htmltools_0.5.3 parallelly_1.32.1 fastmap_1.1.0
[40] htmlwidgets_1.5.4 rlang_1.1.1 shiny_1.7.2
[43] generics_0.1.3 zoo_1.8-10 jsonlite_1.8.0
[46] spatstat.random_3.0-1 ica_1.0-3 dplyr_1.1.2
[49] magrittr_2.0.3 patchwork_1.1.2 Matrix_1.5-1
[52] Rcpp_1.0.9 IRkernel_1.3 munsell_0.5.0
[55] fansi_1.0.3 abind_1.4-5 reticulate_1.26
[58] lifecycle_1.0.3 stringi_1.7.8 MASS_7.3-57
[61] Rtsne_0.16 plyr_1.8.7 grid_4.2.1
[64] parallel_4.2.1 listenv_0.8.0 promises_1.2.0.1
[67] ggrepel_0.9.2 crayon_1.5.2 deldir_1.0-6
[70] miniUI_0.1.1.1 lattice_0.20-45 IRdisplay_1.1
[73] cowplot_1.1.1 splines_4.2.1 tensor_1.5
[76] pillar_1.9.0 igraph_1.3.5 uuid_1.1-0
[79] spatstat.geom_3.0-3 future.apply_1.9.0 reshape2_1.4.4
[82] codetools_0.2-18 leiden_0.4.2 glue_1.6.2
[85] evaluate_0.16 data.table_1.14.6 png_0.1-7
[88] vctrs_0.6.2 httpuv_1.6.5 polyclip_1.10-0
[91] gtable_0.3.1 RANN_2.6.1 purrr_0.3.4
[94] tidyr_1.2.0 scattermore_0.8 future_1.28.0
[97] ggplot2_3.4.0 mime_0.12 xtable_1.8-4
[100] later_1.3.0 survival_3.4-0 viridisLite_0.4.1
[103] tibble_3.2.1 cluster_2.1.4 globals_0.16.1
[106] fitdistrplus_1.1-8 ellipsis_0.3.2 ROCR_1.0-11
Very sorry but it still throws an error, this time like this...
Error in segs[[i]][[1]] : subscript out of bounds
> traceback()
5: segs[[i]][[1]] %>% length
4: FUN(X[[i]], ...)
3: lapply(segs %>% seq, function(i) segs[[i]][[1]] %>% length)
2: ReadVizgen(data.dir = data.dir, ...)
1: LoadVizgen(data.dir = dir_use, fov = "test.sample", assay = "Vizgen",
use.cellpose.out = TRUE, metadata = c("volume", "fov"), mol.type = mol.type,
coord.space = coord.space, type = c("segmentations", "centroids",
"boxes"), z = z.stack, add.zIndex = TRUE, update.object = TRUE,
use.BiocParallel = TRUE, workers.total = 10, DTthreads.pct = NULL)
Very happy to share data if you would prefer to best troubleshoot?
yes, it would then be easier to debug. if you can upload & share on gdrive (alikhuseyno/at/gmail.com) or any link where I can download these files:
detected_transcripts.csv
cell_by_gene.csv
cell_metadata.csv
cell_boundaries.parquet
there will be more updates on that PR #7190 Thanks
I think the problem might come from your .parquet
file, that is, it gives 2 lists in the Geometry variable, where it should be 1 list.
> test_parq %>% filter(ZIndex == 3) %>% pull(Geometry) %>% str
sfc_MULTIPOLYGON of length 152345; first list element: List of 2
$ :List of 1
..$ : num [1:5, 1:2] 1539 1539 1539 1540 1539 ...
$ :List of 1
..$ : num [1:89, 1:2] 1501 1502 1503 1506 1509 ...
- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
Shared - please let me know if there are issues accessing the data.
got it! As far as I see, there are number of segmentation artifacts and empty polygon information. This is a very useful case test, since I never saw such an output from Vizgen. I have to implement few sanity/cleaning steps before extracting segmented cell boundaries, then it should work. will let you know ;) thanks
I'm also getting the same issues as above
Error in segs[[i]][[1]] : subscript out of bounds
yes, I'm working on the fix, which will be available asap.
The problem is that .parquet
file in this case has "artifacts", a number of empty and nested lists for single cell segmentation/polygon boundaries.
ideally one should have single polygon for a single segmented cell ID:
List of 1
$ :List of 1
..$ : num [1:26, 1:2] 2751 2753 2754 2757 2759 ...
- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
However, some have these, ie multiple polygons for single cell.
List of 4
$ :List of 1
..$ : num [1:7, 1:2] 1513 1513 1512 1511 1512 ...
$ :List of 1
..$ : num [1:5, 1:2] 1525 1525 1526 1526 1525 ...
$ :List of 2
..$ : num [1:82, 1:2] 1495 1496 1497 1497 1499 ...
..$ : num [1:17, 1:2] 1507 1510 1512 1514 1516 ...
$ :List of 1
..$ : num [1:6, 1:2] 1498 1497 1497 1498 1498 ...
- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
#..or this
List of 2
$ :List of 1
..$ : num [1:5, 1:2] 1539 1539 1539 1540 1539 ...
$ :List of 1
..$ : num [1:89, 1:2] 1501 1502 1503 1506 1509 ...
- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
It's hard to tell which one of those multiple polygons would be the actual segmented cell. Right now, I'm keeping the single polygon with maximum points for a single cell. Suggestions on which polygons to choose are welcome. Thanks
Yah I think max points would be fine, would love to try the fix!
Thanks, Chang
The fix is ready, try out. re-install: remotes::install_github(repo = 'alikhuseynov/seurat', ref = 'feat/vizgen')
I will be adding support according to suggestions in PR #7190
These packages pkgs <- c("data.table", "arrow", "sfarrow", "tidyverse", "furrr", "BiocParallel")
should be installed, else the function will ask you to install them.
The test Vizgen data is from @andrewjkwok, this part will generate the object:
## make sure all libs are installed a priori!
# load libs ----
suppressPackageStartupMessages({
library(ggplot2)
library(Seurat)
library(dplyr)
library(magrittr)
library(BiocParallel)
library(progressr)
library(spatstat)
})
# helper function to return Seurat object metadata
callmeta <- function (object = NULL) {
return(object@meta.data)
}
# set params
mol.type <- "microns"
coord.space <- "micron"
z.stack <- 3L # z stack to use
dir_use <- "./merfish_seurat_debug/" # or something like "./region_0/"
##
start.time <- Sys.time()
obj <-
LoadVizgen(data.dir = dir_use,
fov = "merfish.test",
assay = "Vizgen",
metadata = c("volume", "fov"), # add cell volume info
mol.type = mol.type, # molecule coords in µm space
coord.space = coord.space,
type = c("segmentations", "centroids"), # type of cell spatial coord matrices
z = z.stack,
add.zIndex = TRUE, # add z slice section to a cell
update.object = TRUE,
use.BiocParallel = TRUE,
workers.MulticoreParam = 14, # for `BiocParallel` processing
use.furrr = TRUE, # when working with list use furrr or purrr
)
end.time <- Sys.time()
message("Time taken to load object = ",
round(end.time - start.time, digits = 2), " ",
attr(c(end.time - start.time), "units"))
Result, it took ~ 5 mins with workers.MulticoreParam = 14
(jupyter session RAM - 180GB)
obj
obj %>% callmeta %>% str
An object of class Seurat
140 features across 232695 samples within 1 assay
Active assay: Vizgen (140 features, 0 variable features)
1 spatial field of view present: merfish.test
'data.frame': 232695 obs. of 6 variables:
$ orig.ident : Factor w/ 1 level "SeuratProject": 1 1 1 1 1 1 1 1 1 1 ...
$ nCount_Vizgen : num 11 69 73 9 14 1 23 3 2 26 ...
$ nFeature_Vizgen: int 7 17 25 6 13 1 12 3 2 6 ...
$ z : int 3 3 3 3 3 3 3 3 3 3 ...
$ volume : num 9961 10654 3499 4381 2164 ...
$ fov : int 217 217 217 217 217 217 217 217 217 217 ...
Hmm I'm getting >>> using segmentations Error in molecules %iff% list(molecules = CreateMolecules(coords = molecules, : object 'mol.type' not found
even though I set it to 'microns'
strange, could you please list the data.dir
(eg "./region_0/"
), to see what files are in there?
202304040410_gw23-121422-thalamusap0903-verymedial_VMSC01801_region_0.vzg cell_boundaries.parquet cell_by_gene.csv cell_metadata.csv detected_transcripts.csv images summary.png
if you run this:
# set params
mol.type <- "microns"
coord.space <- "micron"
z.stack <- 3L # z stack to use
dir_use <- "./merfish_seurat_debug/"
start.time <- Sys.time()
data <-
ReadVizgen(data.dir = dir_use,
#ReadVizgen_opt(data.dir = dir_use,
metadata = c("volume", "fov"), # add cell volume info
mol.type = mol.type, # molecule coords in µm space
coord.space = coord.space,
type = c("segmentations", "centroids"), # type of cell spatial coord matrices
z = z.stack,
use.BiocParallel = TRUE,
workers.MulticoreParam = 14, # for `BiocParallel` processing
use.furrr = TRUE)
data %>% str
The output of data
should be similar to this one, which includes $microns
List of 6
$ transcripts :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
.. ..@ i : int [1:3141093] 33 46 67 74 97 103 120 3 6 20 ...
.. ..@ p : int [1:637986] 0 0 7 24 49 55 68 68 68 69 ...
.. ..@ Dim : int [1:2] 140 637985
.. ..@ Dimnames:List of 2
.. .. ..$ : chr [1:140] "4732456N10Rik" "Ace2" "Adora2a" "Aldh1l1" ...
.. .. ..$ : chr [1:637985] "1771792400012100003" "1771792400012100004" "1771792400012100006" "1771792400012100008" ...
.. ..@ x : num [1:3141093] 1 2 1 1 1 3 2 7 3 1 ...
.. ..@ factors : list()
$ segmentations:Classes ‘data.table’ and 'data.frame': 4214160 obs. of 3 variables:
..$ x : num [1:4214160] 1501 1502 1503 1506 1509 ...
..$ y : num [1:4214160] 41.3 43 43.5 41.2 39.7 ...
..$ cell: chr [1:4214160] "1771792400012100003" "1771792400012100003" "1771792400012100003" "1771792400012100003" ...
..- attr(*, ".internal.selfref")=<externalptr>
$ centroids :'data.frame': 637985 obs. of 3 variables:
..$ x : num [1:637985] 1519 1511 1573 1577 1536 ...
..$ y : num [1:637985] 38 50.3 58.7 72.2 65.6 ...
..$ cell: chr [1:637985] "1771792400012100003" "1771792400012100004" "1771792400012100006" "1771792400012100008" ...
$ microns :'data.frame': 3058829 obs. of 3 variables:
..$ x : num [1:3058829] 97.3 24.2 -23.8 28.1 -22 ...
..$ y : num [1:3058829] 4312 4412 4238 4239 4254 ...
..$ gene: chr [1:3058829] "Ace2" "Adora2a" "Aldh1l1" "Aldh1l1" ...
$ metadata :'data.frame': 637985 obs. of 2 variables:
..$ volume: num [1:637985] 2728 9961 10654 3499 4381 ...
..$ fov : int [1:637985] 217 217 217 217 217 217 217 217 217 217 ...
$ zIndex :'data.frame': 637985 obs. of 2 variables:
..$ z : int [1:637985] 3 3 3 3 3 3 3 3 3 3 ...
..$ cell: chr [1:637985] "1771792400012100003" "1771792400012100004" "1771792400012100006" "1771792400012100008" ...
Ok that works, thanks!
Ok that works, thanks!
could you list it here?
List of 6
$ transcripts :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
.. ..@ i : int [1:2121672] 3 4 8 13 21 27 33 38 45 46 ...
.. ..@ p : int [1:94402] 0 45 109 199 237 324 372 453 552 645 ...
.. ..@ Dim : int [1:2] 180 94401
.. ..@ Dimnames:List of 2
.. .. ..$ : chr [1:180] "TLL1" "GADD45B" "SLC32A1" "PFN1" ...
.. .. ..$ : chr [1:94401] "1952334800040100001" "1952334800040100002" "1952334800040100003" "1952334800040100004" ...
.. ..@ x : num [1:2121672] 1 1 1 1 14 1 1 1 1 1 ...
.. ..@ factors : list()
$ segmentations:Classes 'data.table' and 'data.frame': 891093 obs. of 3 variables:
..$ x : num [1:891093] 5279 5279 5280 5281 5283 ...
..$ y : num [1:891093] -50.4 -49.8 -49 -48.7 -49.4 ...
..$ cell: chr [1:891093] "1952334800040100041" "1952334800040100041" "1952334800040100041" "1952334800040100041" ...
..- attr(*, ".internal.selfref")=<externalptr>
$ centroids :'data.frame': 94401 obs. of 3 variables:
..$ x : num [1:94401] 5279 5279 5288 5272 5295 ...
..$ y : num [1:94401] -67.8 -50.7 -46.9 -32 -13.1 ...
..$ cell: chr [1:94401] "1952334800040100001" "1952334800040100002" "1952334800040100003" "1952334800040100004" ...
$ microns :'data.frame': 3286088 obs. of 3 variables:
..$ x : num [1:3286088] 85.3 85.3 87.4 85.6 86.5 ...
..$ y : num [1:3286088] 5609 5609 5609 5610 5611 ...
..$ gene: chr [1:3286088] "SLC32A1" "SLC32A1" "SLC32A1" "SLC32A1" ...
$ metadata :'data.frame': 94401 obs. of 2 variables:
..$ volume: num [1:94401] 223 335 572 167 974 ...
..$ fov : int [1:94401] 816 816 816 816 816 816 816 816 816 816 ...
$ zIndex :'data.frame': 94401 obs. of 2 variables:
..$ z : int [1:94401] 3 3 3 3 3 3 3 3 3 3 ...
..$ cell: chr [1:94401] "1952334800040100001" "1952334800040100002" "1952334800040100003" "1952334800040100004" ...
hmm, interesting :)
does this part work too? Given the data
you already read.
assay <- "Vizgen"
mol.type <- "microns"
segs <- CreateSegmentation(data[["segmentations"]])
cents <- CreateCentroids(data[["centroids"]])
segmentations.data <- list(centroids = cents, segmentation = segs)
coords <- CreateFOV(coords = segmentations.data,
type = c("segmentation", "centroids"),
molecules = data[[mol.type]],
assay = assay)
Yes this works coords looks like this:
Spatial coordinates for 94401 cells and 180 molecules
First 10 molecules: ADGRL2, ADGRV1, APOLD1, ARX, ASCL1, BCAS1
Default segmentation boundary: centroids
1 other segmentation boundaries present: segmentation
Associated assay: Vizgen
Key: Vizgen_
then, it looks good.
could you post your previous LoadVizgen()
run?
Hmm so it actually works fine now, IDK why it was failing earlier. Thanks for the quick support!
ok, great!
@alikhuseynov Thank you so much for the support. I can confirm that it's working now for me with the parquet file.
However, I am now running into a problem trying to load data from the old format in - I wanted to test my data against some of Vizgen's released data (https://info.vizgen.com/mouse-brain-map), and with the current LoadVizgen
function, I get the following error:
Error in sfarrow::st_read_parquet(file2read) :
object 'file2read' not found
> traceback()
4: make_readable_file(file, mmap)
3: arrow::ParquetFileReader$create(dsn, ...)
2: sfarrow::st_read_parquet(file2read)
1: ReadVizgen(data.dir = "/mnt/z/Users/Andrew_K/Projects/NTS_projection_transcriptome_2023/Data/BrainReceptorShowcase/Slice1/Replicate1/")
Is it possible for the read function to support backward compatibility?
@alikhuseynov Thank you so much for the support. I can confirm that it's working now for me with the parquet file.
However, I am now running into a problem trying to load data from the old format in - I wanted to test my data against some of Vizgen's released data (https://info.vizgen.com/mouse-brain-map), and with the current
LoadVizgen
function, I get the following error:Error in sfarrow::st_read_parquet(file2read) : object 'file2read' not found > traceback() 4: make_readable_file(file, mmap) 3: arrow::ParquetFileReader$create(dsn, ...) 2: sfarrow::st_read_parquet(file2read) 1: ReadVizgen(data.dir = "/mnt/z/Users/Andrew_K/Projects/NTS_projection_transcriptome_2023/Data/BrainReceptorShowcase/Slice1/Replicate1/")
Is it possible for the read function to support backward compatibility?
Thanks, glad it works. Your shared data was very useful for tests.
I haven't yet tested the older Vizgen data (ie .hdf5
cell boundaries). The new fix should be ready next week, which among some optimization will also handle data if no .parquet
file exists.
For now, set use.cellpose.out = FALSE
in LoadVizgen()
and try to load.
Great, thanks. I tried setting use.cellpose.out = FALSE
as suggested - it only works if I also set use.BiocParallel = FALSE
, which unfortunately means it takes quite a while to load the dataset in, but it does work!
@alikhuseynov Thank you so much for the support. I can confirm that it's working now for me with the parquet file.
However, I am now running into a problem trying to load data from the old format in - I wanted to test my data against some of Vizgen's released data (https://info.vizgen.com/mouse-brain-map), and with the current
LoadVizgen
function, I get the following error:Error in sfarrow::st_read_parquet(file2read) : object 'file2read' not found > traceback() 4: make_readable_file(file, mmap) 3: arrow::ParquetFileReader$create(dsn, ...) 2: sfarrow::st_read_parquet(file2read) 1: ReadVizgen(data.dir = "/mnt/z/Users/Andrew_K/Projects/NTS_projection_transcriptome_2023/Data/BrainReceptorShowcase/Slice1/Replicate1/")
Is it possible for the read function to support backward compatibility?
I also got this error when there were multiple parquet files in the folder.
I was able to get it to work with hdf5 files by just using use_cellpose_output = FALSE
. I did get this warning: Warning: MulticoreParam() not supported on Windows, use SnowParam()
, loading each file, but everything loaded fine in the end.
Hmm so it actually works fine now, IDK why it was failing earlier. Thanks for the quick support!
I had this same issue, and eventually realized I had mol.type = 'microns' as an environment variable when testing later on, which fixed the LoadVizgen() function.
I'm going to push for a fix later today, that will solve above mentioned issues.
use.cellpose.out
arg will be removed completely. If no .parquet
file(s) are present, then .hdf5
files are loaded from ./cell boundaries
directory (if present) using use.BiocParallel = TRUE
by default.
MulticoreParam() not supported on Windows, use SnowParam()
are you running things on Windows laptop?
For multiple .parquet
files, you mean something like this?
cellpose_micron_space.parquet
cellpose_mosaic_space.parquet
if yes, that will be solved too.
..will do few tests on my side, before committing Thanks
Yes, I was running it on a Windows laptop.
For the multiple .parquet files, I had a subset of the data in a folder of the directory for testing. This lead to multiple file2read
and an error when the file wasn't located in the directory (but a folder within the directory). I only wanted to read the parquet file in the current directory.
I see, yeah.. that was done for the situation when one has .parquet
file(s) in a sub directory, eg Cellpose_polyT_CB3
:
in ./region_0
├── xxx_region_0.vzg
├── Cellpose_polyT_CB3
├── detected_transcripts.csv
├── images
└── summary.png
I will try to add support to like: .."look for parquet file only in the current dir, if not found, look in the sub dir"
I just did the last fix, please try it out. It did work on my side for old data (ie .hdf5
files) with optimized BiocParallel
, and also on newer Vizgen outputs. It should also support parallelization on windows machine now.
If any bugs or something I missed, let me know ;)
Test run would be this
## test =================================================
## make sure all libs are installed a priori!
# load libs ----
suppressPackageStartupMessages({
library(ggplot2)
library(Seurat)
library(dplyr)
library(magrittr)
library(BiocParallel)
library(progressr)
library(spatstat)
})
# helper function to return Seurat object metadata
callmeta <- function (object = NULL) {
return(object@meta.data)
}
# set directory to read from
dir_use <- "./merfish_seurat_debug/" # or something like "./region_0/"
##
start.time <- Sys.time()
obj <-
LoadVizgen(data.dir = dir_use,
fov = "merfish.test",
assay = "Vizgen",
metadata = c("volume", "fov"), # add cell volume info
type = c("segmentations", "centroids"), # type of cell spatial coord matrices
z = 3L,
add.zIndex = TRUE, # add z slice section to a cell
update.object = TRUE,
use.BiocParallel = TRUE,
workers.MulticoreParam = 14, # for `BiocParallel` processing
verbose = TRUE
)
end.time <- Sys.time()
message("Time taken to load object = ",
round(end.time - start.time, digits = 2), " ",
attr(c(end.time - start.time), "units"))
obj
obj %>% callmeta %>% str
Thanks for the update.
I installed the latest version with remotes::install_github(repo = 'alikhuseynov/seurat', ref = 'feat/vizgen')
.
This somehow now throws up an error:
Cell segmentations are found in `.parquet` file
>>> using micron space coordinates
Error in files[["spatial"]] : subscript out of bounds
> traceback()
2: ReadVizgen(data.dir = data.dir, mol.type = mol.type, filter = filter,
z = z, verbose = verbose, ...)
1: LoadVizgen(data.dir = dir_use, fov = "merfish.test", assay = "Vizgen",
metadata = c("volume", "fov"), mol.type = mol.type, type = c("segmentations",
"centroids"), z = z.stack, add.zIndex = TRUE, update.object = TRUE,
use.BiocParallel = TRUE, workers.MulticoreParam = 14, use.furrr = TRUE,
verbose = TRUE)
Incidentally I keep also getting an error saying that there's no default argument for verbose
, but that's minor. Thanks again in advance for the support of this!
Thanks for the update.
I installed the latest version with
remotes::install_github(repo = 'alikhuseynov/seurat', ref = 'feat/vizgen')
.This somehow now throws up an error:
Cell segmentations are found in `.parquet` file >>> using micron space coordinates Error in files[["spatial"]] : subscript out of bounds > traceback() 2: ReadVizgen(data.dir = data.dir, mol.type = mol.type, filter = filter, z = z, verbose = verbose, ...) 1: LoadVizgen(data.dir = dir_use, fov = "merfish.test", assay = "Vizgen", metadata = c("volume", "fov"), mol.type = mol.type, type = c("segmentations", "centroids"), z = z.stack, add.zIndex = TRUE, update.object = TRUE, use.BiocParallel = TRUE, workers.MulticoreParam = 14, use.furrr = TRUE, verbose = TRUE)
Incidentally I keep also getting an error saying that there's no default argument for
verbose
, but that's minor. Thanks again in advance for the support of this!
You are welcome, I want to make sure it's robust and always works.
Does your dir_use
has these files?
If it's another dataset, those files should be present there.
merfish_seurat_debug/
├── cell_boundaries.parquet
├── cell_by_gene.csv
├── cell_metadata.csv
└── detected_transcripts.csv
files[["spatial"]]
is the basically the cell_metadata.csv
. Error in files[["spatial"]] : subscript out of bounds
would mean that it could not find cell_metadata.csv
file or multiple cell_metadata.csv
files exist inside dir_use
.
Also you don't have to provide mol.type
(it's set to microns
by default) and z
as environmental variables or inside the LoadVizgen()
, unless you want to use other than z = 3L
slice.
It works on my side for a number of runs/datasets and there no warning for verbose
, so not sure yet why it happens.
Try to run as in my comment above and let me know.
Thanks
..for sanity check, could you run the following and see if all 3 files are present?
Got it - turns out exactly as you predicted there was an extra cell metadata file (had tried to edit a version) - works now after deleting that file!
Thank you for your continuous and dedicated efforts of incorporating more and more useful features to Seurat. We recently received our first batch of vizgen data, and I would like to process and visualize them using Seurat.
But the folder structure and files we received are a little different from the public vizgen data. We do not have a cell_boundaries folder with a bunch of h5 files. Instead, we have a single cell_boundaries.parquet file. In this situation, can I still use the cell boundary/segmentation information?
Thank you for your help.