scverse / muon

muon is a multimodal omics Python framework
https://muon.scverse.org/
BSD 3-Clause "New" or "Revised" License
218 stars 31 forks source link

Discrete values after using the CLR normalization pt.pp.clr #144

Open LucHendriks opened 5 months ago

LucHendriks commented 5 months ago

Description Question on output of CLR normalization of protein data. When using the muon.prot.pp.clr() function to apply a CLR transformation on our protein counts we observe a weird result when plotting the counts of the proteins where the output shows some bands of discrete for low values. See image below for raw data and data normalized using CLR function of muon.

Screenshot from 2024-06-27 15-40-41 Screenshot from 2024-06-27 15-40-28

To Reproduce Analysis was run on a subset of the data due to the size of the original dataset. But the data is a 10X CITEseq dataset with 137 proteins.

from muon import prot as pt

# Check the total number of observations
n_obs = mdata['prot'].n_obs

# Determine the size of the subsample
subsample_size = 100000 

# Randomly select the observations
np.random.seed(123) 
sample_indices = np.random.choice(n_obs, subsample_size, replace=False)

# Create the subsample
subsample = mdata['prot'][sample_indices, :].copy()

normalized_counts = pt.pp.clr(subsample, inplace=False)
subsample.layers['clr_dev'] = normalized_counts.X

Expected behaviour Normally after a log transformation you would expect continuous data and not as observed here some discrete values in the lower range. Could this be due to 0 values not being handled correctly?

System

Additional context https://github.com/scverse/muon/blob/94917d23291f329a19b3c282276c960d414319ad/muon/_prot/preproc.py#L201-L240