Open estellad opened 3 months ago
My two cents: it would be ok to append the .1 to the duplicates, but maybe it could be nice to print a warning or just a message alerting the user of this behaviour.
Also, wouldn't it be faster to do something like :
rownames(vis)[duplicated(vis)] <- paste0(rownames(vis)[duplicated(vis)], ".1")
message("Found sum(duplicated(vis)) duplicated rownames, appending .1 to them")
I have another question, are we sure that we can always find only 1 duplicate for each gene? if there are more?
Maybe it could be better to generalize this with an increasing number from 1 to n, where n is the number of duplicates per each duplicated rowname.
About the .h5 matrix, it would be nice to have the possibility to load it instead of the csv matrix.
Dario
FYI: base R make.unique()
is made to do this...
> make.unique(c("a", "a", "a"))
[1] "a" "a.1" "a.2"
...and yes, I'd be in favor of 1) keeping both gene symbols and ensembl IDs as rowData()
, and 2) taking care of setting & unique-ifying rownames()
for the user from the symbols (with a warning when duplicates were found, as Dario suggested).
Hi Estella! @estellad
Can you try this again with TENxIO version 1.7.7
installed? The latest devel version is available on GitHub until it gets built on the BBS.
BiocManager::install("waldronlab/TENxIO")
About the .h5 matrix, it would be nice to have the possibility to load it instead of the csv matrix.
@drighelli
Thanks! This is not supported in VisiumIO
Note. By default, I'm reading in the first column in the features.tsv.gz
file which is typically the ID
column with ensembl IDs. And HDF5Array::TENxMatrix
uses the ensembl IDs as rownames
as well.
Hi Marcel @LiNk-NY ,
I tried to install TENxIO
and then re-install VisiumIO
.
BiocManager::install("waldronlab/TENxIO")
BiocManager::install("VisiumIO", force = TRUE)
The imported Visium object now has gene Ensembl ID as the default rownames and gene Symbol stored in rowData(). 👍
Here are still some action items 😄 :
[ ] The duplicate gene Symbols have not been fixed with make.unique()
, and no warning message on the duplicated gene names are given.
[ ] I tried to remove the Matrix folder in /outs
, but VisiumIO::TENxVisium()
still does not take only .h5 as input. Do you plan on linking the function to TENxIO::TENxH5()
. It should be a very simple update but would accept another file format.
Thank you! Estella
Hi Estella, @estellad
- [ ] The duplicate gene Symbols have not been fixed with
make.unique()
, and no warning message on the duplicated gene names are given.
This is actually implemented here https://github.com/waldronlab/TENxIO/blob/7b86cc13a75a3bebc4def176ac5ca334bcadf9a9/R/TENxH5-class.R#L338-L343
but it only happens with rownames
. Do you mean going into rowData()$Symbols
and running make.unique()
on it? It think it would be better not to munge the data unless necessary but I could add a helper function to do that if needed..
- [ ] I tried to remove the Matrix folder in /outs, but VisiumIO::TENxVisium() still does not take only .h5 as input. Do you plan on linking the function to TENxIO::TENxH5(). It should be a very simple update but would accept another file format.
Do you have a reproducible example? I have tested this with the examples in the package. See : https://github.com/waldronlab/VisiumIO/blob/25324bb14f6160ec8b62f5d1041a70284ff04054/R/TENxVisium-class.R#L175-L180
Thanks!
-Marcel
Hi Marcel,
Thank you for the package! There has been a long-lasting issue with reading in Visium /outs, where there are non-unique gene names. There are 18085 genes, but 18082 unique gene names. Some of the downstream analyses do require unique gene names.
In Seurat, this was solved by adding a ".1" to one of the duplicates in the following genes "HSPA14", "TBCE", "TMSB15B". They are distinguishable based on their Ensembl IDs, which were preserved during the read-in process.
Previously with
SpatialExperimen::read10xVisium()
, where the Ensembl ID was preserved, I had a workaround for this issue by mimicking Seurat's approach:With
VisiumIO::TENxVisium()
, however, I noticed that Ensembl ID is lost or not retrieved from the gene count matrix folder during the read-in.Therefore, I cannot rename them based on Ensembl ID anymore. Could you please automate this renaming process internally, like Seurat, so that users do not have to do this manually every time? Given these 3 genes are not highly expressed, I excluded them in my analysis for now.
In addition, a small inconvenience is the
rownames()
of the object is left as NULL, but I would prefer the default to have something, either Ensembl ID or Symbol.I also noticed that
VisiumIO::TENxVisium()
cannot take a gene count .h5 file, as an alternative to the gene count matrix folder. I find this alternative helpful. I made it flexible in my packageSpatialExperimentIO::readXeniumSXE()
. Would you consider implementing this alternative again, asSpatialExperimen::read10xVisium()
did before?Also looping in @drighelli and @HelenaLC
Thank you for your time!
Sincerely, Estella