niaid / dsb

Normalize CITEseq Data
Other
63 stars 13 forks source link

Should we scale the data? #34

Closed fredust closed 2 years ago

fredust commented 2 years ago

Hello, thank you for designing the great package. I noticed that in the end-to-end vignettes, you also included theScaleData in the Seurat WNN workflow using dsb normalized data. However, in this issue https://github.com/niaid/dsb/issues/4#issuecomment-618435439 you mention not to perform the ScaleData. Is that mean now you still recommend scaling the data before downstream analysis? There is another question. If I perform subset analysis for the dataset (for example, exploring the T cell subset), I don't need to normalise the Cite-seq data again, is that correct? Only re-scaled the data is needed for the subset analysis?

Thank you very much.

MattPM commented 2 years ago

Hi @fredust

I'd generally not recommend re-normalizing data when you subset to a certain subset. There is no universal right answer on whether to scale any data in a downstream analysis task, depends on how you want to interpret the results. For example, if you fit a linear model of protein expression in subset A vs other celltypes and you get an effect size of 1, if you scaled the data, that means the average expression in subset A was 1 standard deviation higher than the mean across all other cells. If you don't scale that means the dsb normalized value was 1 unit greater than the mean across cells. Same applies if you subset to a certain subset, you interpret effect sizes relative to the other cells that were in the analysis and (or visualizations in a heatmap or on a 2d dimensionality reduction etc.)

Regarding multimodal clustering, you can scale the data or not, it is up to you. If you check out the end to end workflow for WNN, there are two methods listed:

Method 1 – Seurat WNN default with PCA on dsb normalized protein Method 2– Seurat WNN with dsb normalized protein directly without PCA

In method 2, even though ScaleData is run, this is just to prevent an error in Seurat, if you look at the actual values input pseudo, you will see they are the dsb normalized values: s@reductions$pdsb@cell.embeddings = pseudo

You could also use the scaled values of the proteins there. This is a hack I made up but it worked better for me than PCA so I included it in the vignette. If you are using one of the lyophilized panels with 200+ proteins, using PCA (method 1) might work well. Hope that helps.

fredust commented 2 years ago

Hi @fredust

I'd generally not recommend re-normalizing data when you subset to a certain subset. There is no universal right answer on whether to scale any data in a downstream analysis task, depends on how you want to interpret the results. For example, if you fit a linear model of protein expression in subset A vs other celltypes and you get an effect size of 1, if you scaled the data, that means the average expression in subset A was 1 standard deviation higher than the mean across all other cells. If you don't scale that means the dsb normalized value was 1 unit greater than the mean across cells. Same applies if you subset to a certain subset, you interpret effect sizes relative to the other cells that were in the analysis and (or visualizations in a heatmap or on a 2d dimensionality reduction etc.)

Regarding multimodal clustering, you can scale the data or not, it is up to you. If you check out the end to end workflow for WNN, there are two methods listed:

Method 1 – Seurat WNN default with PCA on dsb normalized protein Method 2– Seurat WNN with dsb normalized protein directly without PCA

In method 2, even though ScaleData is run, this is just to prevent an error in Seurat, if you look at the actual values input pseudo, you will see they are the dsb normalized values: s@reductions$pdsb@cell.embeddings = pseudo

You could also use the scaled values of the proteins there. This is a hack I made up but it worked better for me than PCA so I included it in the vignette. If you are using one of the lyophilized panels with 200+ proteins, using PCA (method 1) might work well. Hope that helps.

Thank you very much. I used TotalseqC so it's a panel with 131 antibodies +6 isotype controls. I chose Method 1 before because it was similar to the original seurat multimodal method, but I will try Method 2 to see if it works better.

RoseString commented 10 months ago

Just want to share for my data (also ~130 antibodies), UMAP generated with Method 1 was very strange (long oval-shaped clusters).

Screenshot 2023-11-16 at 22 40 10

What worked for me a lot better was to first identify top e.g. 30 variable proteins using FindVariableFeatures(), and then continued with method 2 (do not perform scale data and PCA).

Screenshot 2023-11-16 at 22 44 07