openproblems-bio / openproblems

Formalizing and benchmarking open problems in single-cell genomics
MIT License
290 stars 77 forks source link

Counts being passed to scvi-tools models? #695

Closed adamgayoso closed 1 year ago

adamgayoso commented 1 year ago

Take the immune dataset:

Here counts are moved to .X and the layer is deleted:

https://github.com/openproblems-bio/openproblems/blob/2d7198652875e48c63b6b6d6ab440eec27bafda0/openproblems/data/immune_cells.py#L39-L41 Here .X is made to be the log normalized expression:

https://github.com/openproblems-bio/openproblems/blob/2d7198652875e48c63b6b6d6ab440eec27bafda0/openproblems/tasks/_batch_integration/batch_integration_graph/datasets/immune.py#L25

But I'm confused because in the scib scvi wrapper it's using the counts layer, which shouldn't exist according to this code.

https://github.com/theislab/scib/blob/77ab015254754baa5ca3380bd592bcc9207241de/scib/integration.py#L212

LuckyMD commented 1 year ago

I'm actually not sure where it comes back in at the moment, but I can tell you that the counts layer still exists in the data object.

In [1]: import openproblems as op

In [2]: dat = op.data.immune_cells.load_immune()

In [3]: dat
Out[3]: 
AnnData object with n_obs × n_vars = 33506 × 12303
    obs: 'batch', 'chemistry', 'data_type', 'dpt_pseudotime', 'final_annotation', 'mt_frac', 'n_counts', 'n_genes', 'sample_ID', 'size_factors', 'species', 'study', 'tissue'
    var: 'n_cells'
    uns: 'var_names_all', '_from_cache'
    layers: 'log_normalized', 'counts'
LuckyMD commented 1 year ago

This is in the latest github release... @danielStrobl @scottgigante-immunai... any ideas why this is still there? It definitely should be...

scottgigante-immunai commented 1 year ago

.X is copied to .layers["counts"] by default in all datasets in the dataset decorator.

adamgayoso commented 1 year ago

Indeed I found this:

https://github.com/openproblems-bio/openproblems/blob/6bbcf5c3d85e5a1e793be7075870b13225a54379/openproblems/data/utils.py#L59

Though it should probably have a copy added to it explicitly incase any normalization methods operate in place

scottgigante-immunai commented 1 year ago

It's an interesting thought. I'd rather not double the memory/disk/network usage here for nothing -- what if instead we add a test to the normalization test suite ensuring that normalizations don't operate in place?

LuckyMD commented 1 year ago

Thanks @scottgigante-immunai. I recall something was done in the decorator... but I couldn't quickly find what.

LuckyMD commented 1 year ago

It's still quite strange to find seemingly contradictory lines of code like this though... deleting the counts layer explicitly, but still relying that it's there.

scottgigante-immunai commented 1 year ago

There's no real need to delete it here -- it's only repopulated if it's missing. I'll remove that line for the sake of less confusion.

adamgayoso commented 1 year ago

Also the cpm normalization adds a counts layer as a copy. Ideally you wouldn't need to look in 5 places to find what's going on :)

scottgigante-immunai commented 1 year ago

Ah, that's redundant and can be removed too :) Thanks!