Open antonioggsousa opened 2 years ago
Please refer to the documentation for a user guide for the required preprocessing steps and the API documentation for the different metrics and wrappers. Also check the scib-pipeline for the exact workflow that we used in the paper. This might help address some of your questions.
Conversion with R is something that might need to be updated at some point.
Hi,
Thank you all the developers of
scIB
for the great package and also to suggest useful metrics to compare the performance of integration methods.I've been trying to run the function scib.metrics.metrics_fast() (from
scIB
) for awhile but I've been facing some problems to understand which type of data layers theanndata
objects need to hold depending on the method and parameters used during integration.I should say that I'm mostly an
R
user and I'm interested in comparing the performance ofSeurat
andScanorama
across multiple data sets. I would like to run these methods following the parameters used in the paper Benchmarking atlas-level data integration in single-cell genomics - Fig.3b:Seurat v3 RPCA: RPCA + HVG + scaling (full gene exp. matrix)
Scanorama: HVG + scaling (embedding)
Although I know a bit of
python
, I often prefer to avoid working with it since I find usually myself making simple mistakes which is quite frustrating. Additionally, I'm less familiar with generalpython
packages (numpy
,pandas
, etc) as well as with the more specific scRNA-seq packagescanpy
which turns the code less readable for me. For all these reasons I thought to runSeurat
andScanorama
(withreticulate
) inR
and use the final result to compare the performance withscIB
in order to have the flexibility to change parameters in R for the integration tasks but minimize potential errors when using python. I should also say that the reason why I'm not using the snakemake workflowscib-pipeline
is because I've my own workflow and I don't know how much hard would be try to adapt it to run the integration methods as implemented inscib-pipeline
In order to try to reproduce Seurat v3 RPCA method and parameters, I tried the following (please see a minimal reproducible example below). Briefly, the data set was split by the
batch
variable, 2K HVG found and the data scaled. If I understood correctly there isn't a wrapper function for the methodSeurat
in the packagescIB
, sinceSeurat
is implemented inR
. After reading the code from the repositoryscib-pipeline
, I understood that the conversion betweenanndata
andSeurat
objects is made withas.SingleCellExperiment()
andas.Seurat()
functions. My understanding is that the@data
slot from theSeurat
object will hold the scaled counts and these are for all the genes contrary to the integration python methods, where the HVG are found first, the table of normalized counts subsetted for HVG and, only then, it is performed scaling. On the other hand, inSeurat
the scaled counts in the@data
will be scaled again for the HVG exported as anRDS
format. Is my interpretation correct?In the case of
Scanorama
seems simpler since from the code it seems that what it is provided (considering the preference for the parameters mentioned above) as input is a table of scaled counts for the HVG bybatch
(see short example below). Is my interpretation correct?Finally, I'm assessing the performance with
scIB
(see below). Here, it seems that I need to fix thecategory
variables type and compute KNN graph withscanpy
becauseSeuratDisk
seems unable to saveconnectivities
of this graph to the right placeanndata.obsp["connectivities"]
(perhaps due to incompatibility between versions). Here, my understanding is the following:unintegrated anndata object: it needs to comprise normalized counts in the
adata.X
layer and it needs to contains cell labels (dependending on the metrics that I want to assess).integrated anndata object (if
type_="full"
): it needs to comprise corrected scaled counts after integration, in the case ofSeurat
at least, in theadata.X
layer (there will be as much genes as specified in the HVG parameter).integrated anndata object (if
type_="embed"
): it needs to comprise scaled counts in theadata.X
layer (there will be as much genes as specified in the HVG parameter) andembedding
inadata_int.obsm["X_pca"]
oradata_int.obsm["X_emb"]
(because line).Just to summarise and re-phrase the questions above (which migh be simpler):
Which is the exact input for
Seurat
andScanorama
if I'm interested in using the top performance parameters obtained in the paper? (which were for the methods: Seurat v3 RPCA (RPCA + HVG + scaling - full gene exp. matrix); Scanorama (HVG + scaling - embedding))Which type of information the unintegrated and integrated
anndata
objects need to provide to the functionscib.me.metrics_fast()
depending if I use type equalfull
orembed
?Sorry for the long post.
Thank you in advance for the great package and any feedback that you may give.
Best regards,
António