Open cstrlln opened 4 months ago
The output does not match the standard output format of any commercial technology, unless you use Xenium Ranger to convert it and then use readXenium
. The output includes the gene count matrix and cell metadata as csv files when the options are enabled. Use data.table to read it if it's large. Then reformat the gene count matrix data frame into a dgCMatrix
with cell IDs as column names and gene IDs or symbols as row names. The cell segmentation output is GeoJSON, which can be read with sf::st_read()
, which will give you an sf
data frame. Then call the SpatialFeatureExperiment()
constructor. See the SFE vignette for an example of calling the constructor: https://pachterlab.github.io/SpatialFeatureExperiment/articles/SFE.html#object-construction
Thanks I gave it a try and haven't been able to create the SFE object.
Here is what I did:
library(SpatialFeatureExperiment)
cellmeta <- data.table::fread('data/cell-metadata.csv.gz')
expcounts <- data.table::fread('data/expected-counts.csv.gz')
segmentation <- geojsonsf::geojson_sf('data/cell-polygons.geojson')
###
mat <- as.matrix(expcounts)
mat <- as(t(mat), "dgCMatrix")
sfe <- SpatialFeatureExperiment(list(counts = mat), colData = cellmeta,
colGeometries = list(foo = segmentation))
This is the error I get:
> sfe <- SpatialFeatureExperiment(list(counts = mat), colData = cellmeta,
+ colGeometries = list(foo = segmentation))
Error in wk_handle.wk_wkb(wkb, s2_geography_writer(oriented = oriented, :
Loop 0 is not valid: Edge 34 crosses edge 36
In addition: Warning messages:
1: In st_is_longlat(x) :
bounding box has potentially an invalid value range for longlat data
2: In st_is_longlat(x) :
bounding box has potentially an invalid value range for longlat data
This is the error I get if I try without the geojson file:
> sfe <- SpatialFeatureExperiment(list(counts = mat), colData = cellmeta)
Error in dimnames(x) <- dn :
length of 'dimnames' [2] not equal to array extent
I checked for dimnames and dim and it looks fine:
> str(dimnames(mat))
List of 2
$ : chr [1:1193] "Uba52" "Carmn" "Eif5a" "Ccnd1" ...
$ : chr [1:2061] "1" "2" "3" "4" ...
> dim(mat)
[1] 1193 2061
I have tried a couple of things but can't get it to work, would it be possible to send you the files so that you can take a look?
Regarding this error:
Error in wk_handle.wk_wkb(wkb, s2_geography_writer(oriented = oriented, :
Loop 0 is not valid: Edge 34 crosses edge 36
In addition: Warning messages:
1: In st_is_longlat(x) :
bounding box has potentially an invalid value range for longlat data
2: In st_is_longlat(x) :
bounding box has potentially an invalid value range for longlat data
Did sf::st_read("data/cell-polygons.geojson")
not work? An annoying thing about using geospatial packages is that they often assume a coordinate reference system (CRS), which is irrelevant to the histological space. Basically CRS's are different ways the 3D globe is projected to 2D maps. It seems that in your case, geojsonsf
gave your data a CRS (must be 4326) which makes sf
think that your values are longitudes and latitudes. You can do st_crs(segmentation) <- NA
to remove the CRS. st_read
doesn't give you a CRS by default.
Regarding the error:
Error in dimnames(x) <- dn :
length of 'dimnames' [2] not equal to array extent
You haven't supplied the spatialCoords
or spatialCoordsNames
argument. The default for spatialCoordsNames
is c("x", "y")
. If spatialCoords
is not specified, then spatialCoordsNames
should be found among the columns of colData
. I suppose I should update that example in the vignette for a more general case.
What's the output of traceback()
after you get this error? I don't know where it comes from. It might be caused by say the length of the column names or row names in dn
doesn't match the number of columns or rows in the array. It can be when the number of rows in segmentation
or number of rows in cellmeta
doesn't match the number of columns in mat
. From your output, it seems that you didn't assign the cell IDs to the column names of mat
. Assuming that columns in mat
already match the rows in cellmeta
and both cellmeta
and segmentation
have a column cell_id
for the cell ID. If they don't match, you can match them by segmentation <- segmentation[match(segmentation$cell_id, cellmeta$cell_id),]
.
Another possible cause is that segmentation
has geometry type POLYGON
but some cells have multiple polygons, causing segmentation
to have more rows than cellmeta
. If that's the case, then use segmentation2 <- sf::aggregate(segmentation, list(cell_id = segmentation$cell_id), unique)
to convert to MULTIPOLYGON
where each cell has one geometry. I think I'll add a function to the SFE package to deal with scenarios like this.
See the Geocomputation with R book to learn more about operations on sf
data frames and geometries.
Thanks I'll try this, I guess I just need to understand the object construction better.
Are there any columns that are looked for that have to have harcoded values, like cell_id?
The column name "cell_id" is just an example. Any name can be used, but there should be a column for cell IDs to match the cell metadata to the gene count matrix and to the cell segmentation polygons, unless you are already very sure that they are all in the same order in the cell metadata, gene count matrix, and polygons.
It does seem that there is a slight missmatch between the cell id in the raw cellmeta and the names in the count matrix and segmentation.
The rownames of expcounts (genes in columns and cells in rows in raw file) are a cell name starting in "1":
> head(rownames(expcounts))
[1] "1" "2" "3" "4" "5" "6"
Whereas in cellmeta the 'cell' value starts at '0'
> head(cellmeta)
cell centroid_x centroid_y centroid_z fov cluster volume population
<int> <num> <num> <num> <int> <int> <num> <int>
1: 0 353.22656 293.0781 4.765625 99 1 320.0 27
2: 1 213.03372 403.7463 4.519795 99 1 852.5 186
3: 2 77.85435 142.6783 6.038043 99 0 1150.0 508
4: 3 319.43000 233.6000 4.500000 99 1 500.0 80
5: 4 114.46059 110.4803 6.422414 99 0 507.5 197
6: 5 165.97500 464.0200 5.937500 99 1 500.0 122
And the values in segmentation cell column are also starting at '0', so I'll try to match them all. Also, sf::st_read didnt work I think because I was trying to read the .gz file directly(?), thats why I moved to geojson_sf and you are correct that it added a CRS. Also the geojson file I used is made of multipolygons:
> head(segmentation)
Simple feature collection with 6 features and 1 field
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 68 ymin: 104 xmax: 359 ymax: 470
Geodetic CRS: WGS 84
cell geometry
1 0 MULTIPOLYGON (((349 287, 34...
2 1 MULTIPOLYGON (((204 402, 20...
3 2 MULTIPOLYGON (((70 138, 70 ...
4 3 MULTIPOLYGON (((314 233, 31...
5 4 MULTIPOLYGON (((110 111, 11...
6 5 MULTIPOLYGON (((161 463, 16...
Thanks for the feedback I'll try again tomorrow with your suggestions.
Also, sf::st_read didnt work I think because I was trying to read the .gz file directly(?)
I think it's probably because of the gz. Do cellmeta and segmentation have the same number of rows, which should be the same as the number of columns in the gene count matrix? Also, when the rows are cells, the gene count matrix csv file might have an unnamed column for cell ID which are row names of the matrix. When it's unnamed, data.table reads it into R as the first column with name V1
and whatever row names you get are just the default which is 1:nrow(df)
.
@cstrlln did it work for you to make the SFE object?
@alikhuseynov I got side tracked trying another tool, so didn't test the suggestions yet but plan to try in the next couple weeks. I really think it might be the cell names though.
@alikhuseynov I got side tracked trying another tool, so didn't test the suggestions yet but plan to try in the next couple weeks. I really think it might be the cell names though.
ok, if you send us the output files or deposit them somewhere where we can download them, I will try to help making that SFE object.
could you suggest how to appropriately import the output data from proseg into voyager? https://github.com/dcjones/proseg