theislab / scib-reproducibility

Additional code and analysis from the single-cell integration benchmarking project
https://theislab.github.io/scib-reproducibility/
MIT License
52 stars 14 forks source link

unintegrated HVG only #23

Closed wconnell closed 1 year ago

wconnell commented 1 year ago

Hi, I am curious if you show results for unintegrated data subsetted to HVG?

For example I did my own analysis subsetting the Human Immune dataset to 1200 HVG (slightly different normalization - I took the 'counts' layer and applied library size normalization [instead of scanorama] and log1p transformation). This seems to remove most batch effects by itself. The left screenshot is from the data without subsetting, the right screenshot is after subsetting to 1200 HVG.

The metrics "AVG-" metrics are an average of the individual metrics for bio/batch. I don't think these are exactly the same aggregate metrics as your analysis but simply subsetting to HVG seems to increase each score a lot. Did you look at scores relative this baseline?

Screenshot 2023-05-11 at 8 49 24 AM Screenshot 2023-05-11 at 8 47 22 AM
lazappi commented 1 year ago

Hi @wconnell

I think you are right, we only considered the full feature set for the unintegrated data. From anecdotal evidence I have seen, feature selection by itself can sometimes do a decent job of removing differences between batches but it depends on the dataset, number of batches, how features are selected, strength of batch effects etc.

I will also add that interpreting the raw metric scores can be difficult and you often need scores from several different methods/pre-processing steps/parameter sets to see what a "good" or "bad" score is.

wconnell commented 1 year ago

thanks for the reply!