proseg output support - Githubissues

cstrlln commented 4 months ago

could you suggest how to appropriately import the output data from proseg into voyager? https://github.com/dcjones/proseg

lambdamoses commented 4 months ago

The output does not match the standard output format of any commercial technology, unless you use Xenium Ranger to convert it and then use readXenium. The output includes the gene count matrix and cell metadata as csv files when the options are enabled. Use data.table to read it if it's large. Then reformat the gene count matrix data frame into a dgCMatrix with cell IDs as column names and gene IDs or symbols as row names. The cell segmentation output is GeoJSON, which can be read with sf::st_read(), which will give you an sf data frame. Then call the SpatialFeatureExperiment() constructor. See the SFE vignette for an example of calling the constructor: https://pachterlab.github.io/SpatialFeatureExperiment/articles/SFE.html#object-construction

cstrlln commented 4 months ago

Thanks I gave it a try and haven't been able to create the SFE object.

Here is what I did:

library(SpatialFeatureExperiment)

cellmeta <- data.table::fread('data/cell-metadata.csv.gz')

expcounts <- data.table::fread('data/expected-counts.csv.gz')

segmentation <- geojsonsf::geojson_sf('data/cell-polygons.geojson')

###

mat <- as.matrix(expcounts)
mat <- as(t(mat), "dgCMatrix")

sfe <- SpatialFeatureExperiment(list(counts = mat), colData = cellmeta,
                                colGeometries = list(foo = segmentation))

This is the error I get:

> sfe <- SpatialFeatureExperiment(list(counts = mat), colData = cellmeta,               
+                                 colGeometries = list(foo = segmentation))
Error in wk_handle.wk_wkb(wkb, s2_geography_writer(oriented = oriented,  : 
  Loop 0 is not valid: Edge 34 crosses edge 36                          
In addition: Warning messages:                                          
1: In st_is_longlat(x) :                                                
  bounding box has potentially an invalid value range for longlat data  
2: In st_is_longlat(x) :                                                
  bounding box has potentially an invalid value range for longlat data

This is the error I get if I try without the geojson file:

> sfe <- SpatialFeatureExperiment(list(counts = mat), colData = cellmeta)
Error in dimnames(x) <- dn : 
  length of 'dimnames' [2] not equal to array extent

I checked for dimnames and dim and it looks fine:

> str(dimnames(mat))                                                      
List of 2
 $ : chr [1:1193] "Uba52" "Carmn" "Eif5a" "Ccnd1" ...
 $ : chr [1:2061] "1" "2" "3" "4" ...
> dim(mat)
[1] 1193 2061

I have tried a couple of things but can't get it to work, would it be possible to send you the files so that you can take a look?

lambdamoses commented 4 months ago

Regarding this error:

Error in wk_handle.wk_wkb(wkb, s2_geography_writer(oriented = oriented,  : 
  Loop 0 is not valid: Edge 34 crosses edge 36                          
In addition: Warning messages:                                          
1: In st_is_longlat(x) :                                                
  bounding box has potentially an invalid value range for longlat data  
2: In st_is_longlat(x) :                                                
  bounding box has potentially an invalid value range for longlat data

Did sf::st_read("data/cell-polygons.geojson") not work? An annoying thing about using geospatial packages is that they often assume a coordinate reference system (CRS), which is irrelevant to the histological space. Basically CRS's are different ways the 3D globe is projected to 2D maps. It seems that in your case, geojsonsf gave your data a CRS (must be 4326) which makes sf think that your values are longitudes and latitudes. You can do st_crs(segmentation) <- NA to remove the CRS. st_read doesn't give you a CRS by default.

Regarding the error:

Error in dimnames(x) <- dn : 
  length of 'dimnames' [2] not equal to array extent

You haven't supplied the spatialCoords or spatialCoordsNames argument. The default for spatialCoordsNames is c("x", "y"). If spatialCoords is not specified, then spatialCoordsNames should be found among the columns of colData. I suppose I should update that example in the vignette for a more general case.

What's the output of traceback() after you get this error? I don't know where it comes from. It might be caused by say the length of the column names or row names in dn doesn't match the number of columns or rows in the array. It can be when the number of rows in segmentation or number of rows in cellmeta doesn't match the number of columns in mat. From your output, it seems that you didn't assign the cell IDs to the column names of mat. Assuming that columns in mat already match the rows in cellmeta and both cellmeta and segmentation have a column cell_id for the cell ID. If they don't match, you can match them by segmentation <- segmentation[match(segmentation$cell_id, cellmeta$cell_id),].

Another possible cause is that segmentation has geometry type POLYGON but some cells have multiple polygons, causing segmentation to have more rows than cellmeta. If that's the case, then use segmentation2 <- sf::aggregate(segmentation, list(cell_id = segmentation$cell_id), unique) to convert to MULTIPOLYGON where each cell has one geometry. I think I'll add a function to the SFE package to deal with scenarios like this.

See the Geocomputation with R book to learn more about operations on sf data frames and geometries.

cstrlln commented 4 months ago

Thanks I'll try this, I guess I just need to understand the object construction better.

Are there any columns that are looked for that have to have harcoded values, like cell_id?

lambdamoses commented 4 months ago

The column name "cell_id" is just an example. Any name can be used, but there should be a column for cell IDs to match the cell metadata to the gene count matrix and to the cell segmentation polygons, unless you are already very sure that they are all in the same order in the cell metadata, gene count matrix, and polygons.

cstrlln commented 4 months ago

It does seem that there is a slight missmatch between the cell id in the raw cellmeta and the names in the count matrix and segmentation.

The rownames of expcounts (genes in columns and cells in rows in raw file) are a cell name starting in "1":

> head(rownames(expcounts))
[1] "1" "2" "3" "4" "5" "6"

Whereas in cellmeta the 'cell' value starts at '0'

> head(cellmeta)                                                                  
    cell centroid_x centroid_y centroid_z   fov cluster volume population
   <int>      <num>      <num>      <num> <int>   <int>  <num>      <int>
1:     0  353.22656   293.0781   4.765625    99       1  320.0         27
2:     1  213.03372   403.7463   4.519795    99       1  852.5        186
3:     2   77.85435   142.6783   6.038043    99       0 1150.0        508
4:     3  319.43000   233.6000   4.500000    99       1  500.0         80
5:     4  114.46059   110.4803   6.422414    99       0  507.5        197
6:     5  165.97500   464.0200   5.937500    99       1  500.0        122

And the values in segmentation cell column are also starting at '0', so I'll try to match them all. Also, sf::st_read didnt work I think because I was trying to read the .gz file directly(?), thats why I moved to geojson_sf and you are correct that it added a CRS. Also the geojson file I used is made of multipolygons:

> head(segmentation)                                                              
Simple feature collection with 6 features and 1 field
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 68 ymin: 104 xmax: 359 ymax: 470
Geodetic CRS:  WGS 84
  cell                       geometry
1    0 MULTIPOLYGON (((349 287, 34...
2    1 MULTIPOLYGON (((204 402, 20...
3    2 MULTIPOLYGON (((70 138, 70 ...
4    3 MULTIPOLYGON (((314 233, 31...
5    4 MULTIPOLYGON (((110 111, 11...
6    5 MULTIPOLYGON (((161 463, 16...

Thanks for the feedback I'll try again tomorrow with your suggestions.

lambdamoses commented 4 months ago

Also, sf::st_read didnt work I think because I was trying to read the .gz file directly(?)

I think it's probably because of the gz. Do cellmeta and segmentation have the same number of rows, which should be the same as the number of columns in the gene count matrix? Also, when the rows are cells, the gene count matrix csv file might have an unnamed column for cell ID which are row names of the matrix. When it's unnamed, data.table reads it into R as the first column with name V1 and whatever row names you get are just the default which is 1:nrow(df).

alikhuseynov commented 3 months ago

@cstrlln did it work for you to make the SFE object?

cstrlln commented 3 months ago

@alikhuseynov I got side tracked trying another tool, so didn't test the suggestions yet but plan to try in the next couple weeks. I really think it might be the cell names though.

alikhuseynov commented 3 months ago

@alikhuseynov I got side tracked trying another tool, so didn't test the suggestions yet but plan to try in the next couple weeks. I really think it might be the cell names though.

ok, if you send us the output files or deposit them somewhere where we can download them, I will try to help making that SFE object.

pachterlab / voyager

proseg output support #32