scCOVID-19 / COVIDPBMC

Multi-omics profiling of peripheral immune system response to SARS-CoV-2 infection
GNU General Public License v3.0
40 stars 15 forks source link

Replicate UMAP projection of initial clustering [Stephenson et al. (2021) - Fig. 1b] #4

Closed DylanMannKrzisnik closed 2 years ago

DylanMannKrzisnik commented 2 years ago

I am attempting to replicate the UMAP projection show in Figure 1b of Stephenson et al. (2021).

The data I am using is imported from 'covid_portal_210320_with_raw.h5ad' (I believe 'haniffa21.processed.h5ad' is equivalent) into an annotated data variable using scanpy. This variable contains the UMAP projection (i.e. adata.obsm['X_umap']) which indeed matches the aforementioned Figure 1b - so far so good.

However, I would like to recreate the steps which lead to the UMAP projection. I am mainly drawing inspiration from Stephenson et al.'s article as well as this script (which is part of this repo) and replicating the steps using the imported adata variable. So far, I am unsuccessful.

I figured the second-easiest thing to do would be to calculate connectivities on the pre-computed Harmony PCA basis followed by UMAP:

sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40, use_rep = 'X_pca_harmony')
sc.tl.umap(data)

However, this fails to recreate the original figure. My impression is that I'm not using the right data. For instance, the annotated data I'm using originally has 647,366 cells whereas the caption of Figure 1b mentions the use of 781,123 cells. Moreover, it does not seem possible to simply run this script due to the data files that that script uses, which I could not find.

My general question is how would I recreate Figure 1b, asides from simply plotting adata.obsm['X_umap'] ? Your advice is appreciated.

zktuong commented 2 years ago

hi @DylanMannKrzisnik, I'll try and answer your questions:

1) the cell number discrepancies is because during the various rounds of QC/clean up/annotations, there are cells/droplets that we decide later on are more likely to be doublets or 'ambiguous' and we subsequently remove them from the analysis (and also from the final umap in Fig 1). 2) the current umap on the portal and portal corresponds to fig1b and would have been computed using the combined data prior to removing the ambiguous droplets. We took an iterative sub-clustering and clean up approach for the subsequent figures which are absent of these ambiguous droplets 3) as the first umap is only for visualising the broad structure of the data, we did not feel there was a need to recompute the umap as it would not impact on how we subsequently analysed the data. 4) I think there's no guarantee you would arrive at the same umap coordinates even if we were to provided the data with doublets because of machine differences.

DylanMannKrzisnik commented 2 years ago

Hello @zktuong, thank you for the clarifications, much appreciated. I agree with your points, such that it is improbable that I would exactly recreate the UMAP projection for the reasons you mention.

I suppose then that the UMAP coordinates stored in the annotated data object are simply the projections of the retained samples onto the UMAP embedding initially computed using the combined data?

In terms of not being to replicate Figure 1B, I mean this in a broader sense (I should have been clearer). Meaning, I would have expected mostly non-overlapping clusters for each cell-type, not that the clusters would have to look identical to those of Figure 1B from the original publication. However, if one compares the UMAP projection included in the annotated object (top plot) and those created by running sc.tl.umap on the Harmony PCA coordinates (bottom plot) there are striking dissimilarities.

Perhaps I am using a different version of Scanpy (amongst other librairies) which could explain such differences. Or perhaps the removal of the ambiguous droplets make a significant difference (I would doubt it). I've tested different parameter settings for UMAP, Harmony PCA, PCA, etc. which all yield similar star-shaped patterns as shown in the bottom plot. Perhaps there are some parameter settings which have not been specified in this script? Maybe the files into which data is saved in the script are available somewhere (i.e. 'combined_dec.h5ad' and 'combined_harmonyv3.h5ad')?

Thank you for your help and patience!

Fig1b_original Fig1b_recreate

zktuong commented 2 years ago

I suppose then that the UMAP coordinates stored in the annotated data object are simply the projections of the retained samples onto the UMAP embedding initially computed using the combined data?

yes that's correct

I've tested different parameter settings for UMAP, Harmony PCA, PCA, etc. which all yield similar star-shaped patterns as shown in the bottom plot.

yes that's a known issue with a version of scanpy/umap: https://github.com/scverse/scanpy/issues/2101

just downgrade/upgrade both scanpy and umap accordingly and it should resolve (e.g. scanpy==1.8.2 and umap-learn==0.5.1) works for me

DylanMannKrzisnik commented 2 years ago

Indeed, that seems to have been the issue. After updating to python 3.8.10, scanpy 1.9.1 and umap-learn 0.5.3, I get the following plot:

GetImage

Thanks alot for your help!

zktuong commented 2 years ago

no worries. I'll close this issue, and the other one as well now!