waldronlab / VisiumIO

Import spaceranger output and 10X spatial data
0 stars 0 forks source link

Suggestions on: Keep ensembl gene ID at read-in, automate non-unique gene renaming, add default rownames(), accept .h5 format :) #4

Open estellad opened 3 months ago

estellad commented 3 months ago

Hi Marcel,

Thank you for the package! There has been a long-lasting issue with reading in Visium /outs, where there are non-unique gene names. There are 18085 genes, but 18082 unique gene names. Some of the downstream analyses do require unique gene names.

> tail(sort(table(rowData(vis)$Symbol)))

 ZYG11B     ZYX   ZZEF1  HSPA14    TBCE TMSB15B 
            1           1            1             2            2               2 

In Seurat, this was solved by adding a ".1" to one of the duplicates in the following genes "HSPA14", "TBCE", "TMSB15B". They are distinguishable based on their Ensembl IDs, which were preserved during the read-in process.

Previously with SpatialExperimen::read10xVisium(), where the Ensembl ID was preserved, I had a workaround for this issue by mimicking Seurat's approach:

rowData(vis)$gene_id <- rownames(vis)
rownames(vis) <- rowData(vis)$symbol

rownames(vis) <- ifelse(rowData(vis)$gene_id == "ENSG00000285053",  "TBCE", rownames(vis))
rownames(vis) <- ifelse(rowData(vis)$gene_id == "ENSG00000284770",  "TBCE.1", rownames(vis))
rownames(vis) <- ifelse(rowData(vis)$gene_id == "ENSG00000284024",  "HSPA14", rownames(vis))
rownames(vis) <- ifelse(rowData(vis)$gene_id == "ENSG00000187522",  "HSPA14.1", rownames(vis))
rownames(vis) <- ifelse(rowData(vis)$gene_id == "ENSG00000158427",  "TMSB15B", rownames(vis))
rownames(vis) <- ifelse(rowData(vis)$gene_id == "ENSG00000269226",  "TMSB15B.1", rownames(vis))

With VisiumIO::TENxVisium(), however, I noticed that Ensembl ID is lost or not retrieved from the gene count matrix folder during the read-in.

vis <- TENxVisium(
  spacerangerOut = vis_path, processing = "filtered", images = "lowres"
)
vis <- import(vis)
vis
class: SpatialExperiment 
dim: 18085 4992 
metadata(0):
assays(1): counts
rownames: NULL
rowData names(1): Symbol
colnames(4992): AACACCTACTATCGAA-1 AACACGTGCATCGCAC-1 ...
  TGTTGGCCAGACCTAC-1 TGTTGGCCTACACGTG-1
colData names(4): in_tissue array_row array_col sample_id
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
spatialCoords names(2) : pxl_col_in_fullres pxl_row_in_fullres
imgData names(4): sample_id image_id data scaleFactor

Therefore, I cannot rename them based on Ensembl ID anymore. Could you please automate this renaming process internally, like Seurat, so that users do not have to do this manually every time? Given these 3 genes are not highly expressed, I excluded them in my analysis for now.

In addition, a small inconvenience is the rownames() of the object is left as NULL, but I would prefer the default to have something, either Ensembl ID or Symbol.

I also noticed that VisiumIO::TENxVisium() cannot take a gene count .h5 file, as an alternative to the gene count matrix folder. I find this alternative helpful. I made it flexible in my package SpatialExperimentIO::readXeniumSXE(). Would you consider implementing this alternative again, as SpatialExperimen::read10xVisium() did before?

Also looping in @drighelli and @HelenaLC

Thank you for your time!

Sincerely, Estella

drighelli commented 3 months ago

My two cents: it would be ok to append the .1 to the duplicates, but maybe it could be nice to print a warning or just a message alerting the user of this behaviour.

Also, wouldn't it be faster to do something like :

rownames(vis)[duplicated(vis)] <- paste0(rownames(vis)[duplicated(vis)], ".1")
message("Found sum(duplicated(vis)) duplicated rownames, appending .1 to them")

I have another question, are we sure that we can always find only 1 duplicate for each gene? if there are more?

Maybe it could be better to generalize this with an increasing number from 1 to n, where n is the number of duplicates per each duplicated rowname.

About the .h5 matrix, it would be nice to have the possibility to load it instead of the csv matrix.

Dario

HelenaLC commented 3 months ago

FYI: base R make.unique() is made to do this...

> make.unique(c("a", "a", "a"))
[1] "a"   "a.1" "a.2"

...and yes, I'd be in favor of 1) keeping both gene symbols and ensembl IDs as rowData(), and 2) taking care of setting & unique-ifying rownames() for the user from the symbols (with a warning when duplicates were found, as Dario suggested).

LiNk-NY commented 2 months ago

Hi Estella! @estellad Can you try this again with TENxIO version 1.7.7 installed? The latest devel version is available on GitHub until it gets built on the BBS.

BiocManager::install("waldronlab/TENxIO")
LiNk-NY commented 2 months ago

About the .h5 matrix, it would be nice to have the possibility to load it instead of the csv matrix.

@drighelli Thanks! This is not supported in VisiumIO

LiNk-NY commented 2 months ago

Note. By default, I'm reading in the first column in the features.tsv.gz file which is typically the ID column with ensembl IDs. And HDF5Array::TENxMatrix uses the ensembl IDs as rownames as well.

estellad commented 2 months ago

Hi Marcel @LiNk-NY ,

I tried to install TENxIO and then re-install VisiumIO.

BiocManager::install("waldronlab/TENxIO")
BiocManager::install("VisiumIO", force = TRUE)

The imported Visium object now has gene Ensembl ID as the default rownames and gene Symbol stored in rowData(). 👍

Here are still some action items 😄 :

Thank you! Estella

LiNk-NY commented 2 months ago

Hi Estella, @estellad

  • [ ] The duplicate gene Symbols have not been fixed with make.unique(), and no warning message on the duplicated gene names are given.

This is actually implemented here https://github.com/waldronlab/TENxIO/blob/7b86cc13a75a3bebc4def176ac5ca334bcadf9a9/R/TENxH5-class.R#L338-L343 but it only happens with rownames. Do you mean going into rowData()$Symbols and running make.unique() on it? It think it would be better not to munge the data unless necessary but I could add a helper function to do that if needed..

  • [ ] I tried to remove the Matrix folder in /outs, but VisiumIO::TENxVisium() still does not take only .h5 as input. Do you plan on linking the function to TENxIO::TENxH5(). It should be a very simple update but would accept another file format.

Do you have a reproducible example? I have tested this with the examples in the package. See : https://github.com/waldronlab/VisiumIO/blob/25324bb14f6160ec8b62f5d1041a70284ff04054/R/TENxVisium-class.R#L175-L180

Thanks!

-Marcel