openproblems-bio / openproblems

Formalizing and benchmarking open problems in single-cell genomics
MIT License
314 stars 78 forks source link

Pancreas data is dense and contains empty genes #102

Closed scottgigante closed 1 year ago

scottgigante commented 4 years ago

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Minimal code sample (that we can copy&paste without having any data)

import openproblems
adata = openproblems.tasks.label_projection.datasets.pancreas_batch()
type(adata.X)
# <class 'numpy.ndarray'>
adata.X.sum(axis=0)
# 0.0
LuckyMD commented 4 years ago

I just changed the URL for the pancreas datasets yesterday I think. I can add the cell and gene filtering in the data loader.

LuckyMD commented 4 years ago

filtered out the 6 genes with 0 counts from the dataset I uploaded in #104. It seems I uploaded a dense matrix, as we used that in the benchmarking. Would you like me to upload a sparse version of this or is it sufficient to make it sparse in the data loader?

scottgigante commented 4 years ago

I think it makes sense to make the download sparse -- it decreases download time and memory use in loading quite substantially.

flo-compbio commented 3 years ago

I don't understand the reason for requiring all genes to have a non-zero sum in all datasets. Methods should definitely be able to handle datasets where some genes are not expressed.

scottgigante commented 3 years ago

There is just no reason to include these genes though, right? Plenty of methods do have issues with empty cells or empty genes, so why bother fixing every method when it's easy just to remove them?

flo-compbio commented 3 years ago

I don't know, I would argue that in most contexts where you work with data, features can be all zero without it creating an issue. Measurements of zero are also measurements. Also, depending on how you preprocess a single-cell dataset (filter cells etc.) individuals features can become completely zero or not. It sometimes creates more problems if the feature set changes for each version of the data just because you processed it slightly differently. I think this whole removing genes thing has become a bit of a habit in the field simply because you sometimes start with matrices that contain so many non-coding/unexpressed genes and so many cells that they just don't fit into the memory unless you filter out as many genes as possible. To me that shouldn't mean that every dataset that contains a few all-zero genes should somehow be considered "invalid".

LuckyMD commented 3 years ago

Processing should not create zeros, but rather remove them. I would also argue that a dataset that has 0 count genes or cells is not sufficiently filtered/QC'ed.. unless it's a part of a larger dataset collection that will be merged, in which case this is a desired state. A 0 count gene does not provide any further information for downstream processing, so why keep it? It's ignored in embeddings which typically use HVGs; I can't do DE analysis on it due to lack of defined variance. If I decide to extend my genome with a few very specific gene variants I will have more additional 0 count genes. That doesn't mean I have any more information.

I'm frankly surprised that all of the methods we benchmarked for data integration could deal with genes with 0 counts... I could have bet ComBat would have issue with this as 0 count genes have no definable variance.

LuckyMD commented 3 years ago

@scottgigante so apparently sparsifying the expression data turns the pancreas dataset from 300MB into 850MB. It seems the sparsity of the expression matrix is quite low. Maybe it shouldn't be changed in that case?

I will double check this... might be making a mistake here... who knows.

LuckyMD commented 3 years ago

Confirmed that the 18.1% of the matrix is non-zero. The other issue is that the adata.layers['counts'] are float32 as there are non-integer values in there with full-length protocols that are available as RPKM (Xin et al data). So I don't think this should be provided as sparse tbh.

I would then regard this issue as solved, as filtering was added to the data loader to address the empty genes.

scottgigante commented 3 years ago

It doesn't make much sense to me that a 18% non-zero sparse matrix would be larger than a dense matrix of the same size. Example:

import numpy as np
import scipy.sparse
import pickle
import os

X = np.zeros((5000, 1000), dtype=np.float32)
X[:,:180] = np.random.normal(0, 1, (5000, 180))
with open("dense.pkl", "wb") as f:
    pickle.dump(X, f)

os.stat("dense.pkl").st_size // 1024
# 19531

X_sparse = scipy.sparse.csr_matrix(X)
with open("sparse.pkl", "wb") as f:
    pickle.dump(X_sparse , f)

os.stat("sparse.pkl").st_size // 1024
# 7051

Here's my attempt at sparsifying the pancreas data:

import openproblems
adata = openproblems.data.pancreas.load_pancreas(test=False)
adata.write_h5ad("dense.h5ad")
os.stat("dense.h5ad").st_size // (1024**2)
# 1170

adata.X = scipy.sparse.csr_matrix(adata.X)
adata.write_h5ad("sparse.h5ad")
os.stat("sparse.h5ad").st_size // (1024**2)
# 426

adata.X.indices.max()
18696
np.iinfo("uint16").max
65535
adata.X.indices = adata.X.indices.astype(np.uint16)
adata.X.data = adata.X.data.astype(np.float16)
adata.obs['n_counts'] = adata.obs['n_counts'].astype(np.float16)
adata.var['n_counts'] = adata.var['n_counts'].astype(np.float16)
os.stat("sparse_optimized.h5ad").st_size // (1024**2)
# 214
LuckyMD commented 3 years ago

Not sure what to tell you... but I've spent the past hour or so trying to figure out what else it could be... but I'm just doing a adata.X = scipy.sparse.csr_matrix(adata.X) and the same for the counts layer, and I'm getting sth that's more than twice the size after sparsifying... All I'm doing is the above... Not sure if scipy.sparse.csr_matrix() by default converts to float64 or sth like that... but otherwise I'm out of ideas tbh...

LuckyMD commented 3 years ago

is that 1.17GB on disk for the dense matrix? The object I'm building this from is ~300MB also on Figshare...

Maybe there's something in adata.raw that is being thrown away in openproblems that causes this? It's very confusing either way...

LuckyMD commented 3 years ago
In [29]: adata_panc = sc.read("/storage/groups/ml01/workspace/scIB/pancreas_joint/pancreas_jointnorm.h5ad")  

In [37]: os.stat("pancreas_jointnorm.h5ad").st_size // 1024**2                                                                                                                                
Out[37]: 301

In [38]: from scipy import sparse                                                                                                                                                             

In [39]: adata_panc.X = sparse.csr_matrix(adata_panc.X)                                                                                                                                       

In [40]: adata_panc.layers['counts'] = sparse.csr_matrix(adata_panc.layers['counts'])                                                                                                         

In [41]: adata_panc.X                                                                                                                                                                         
Out[41]: 
<16382x19093 sparse matrix of type '<class 'numpy.float32'>'
    with 55644200 stored elements in Compressed Sparse Row format>

In [42]: pwd                                                                                                                                                                                  
Out[42]: '/mnt/znas/icb_zstore01/groups/ml01/workspace/scIB/pancreas_joint'

In [43]: adata_panc.write("test_sparse.h5ad")                                                                                                                                                 

In [45]: os.stat("test_sparse.h5ad").st_size // 1024**2                                                                                                                                       
Out[45]: 851

I'm doing the same thing you are... it's just not working for me ^^. This is a slightly different object though, as I haven't done any preprocessing yet... Mine is just the object from Figshare.

scottgigante commented 3 years ago

Looks like your initial file was compressed, which is why it's small on disk but large in memory. Also, since we remove adata.layers["counts"] in the loader, we probably shouldn't include it in the initial file.

import anndata
import os
import scipy.sparse
import numpy as np

os.stat("human_pancreas_norm_complexBatch.h5ad").st_size // (1024**2)
# 301

adata = anndata.read_h5ad("human_pancreas_norm_complexBatch.h5ad")
adata.write_h5ad("dense.h5ad")
os.stat("dense.h5ad").st_size // (1024**2)
# 2388

adata.write_h5ad("dense_compressed.h5ad", compression='gzip', compression_opts=9)
os.stat("dense_compressed.h5ad").st_size // (1024**2)
# 271

adata.X = scipy.sparse.csr_matrix(adata.layers["counts"])
del adata.layers["counts"]
adata.write_h5ad("sparse.h5ad")
os.stat("sparse.h5ad").st_size // (1024**2)
# 426

adata.write_h5ad("sparse_compressed.h5ad", compression='gzip', compression_opts=9)
os.stat("sparse_compressed.h5ad").st_size // (1024**2)
# 136

adata.X.indices = adata.X.indices.astype(np.uint16)
adata.X.data = adata.X.data.astype(np.float16)
adata.obs['size_factors'] = adata.obs['size_factors'].astype(np.float16)
adata.write_h5ad("sparse_optimized.h5ad")
os.stat("sparse_optimized.h5ad").st_size // (1024**2)
# 214

adata.write_h5ad("sparse_optimized_compressed.h5ad", compression='gzip', compression_opts=9)
os.stat("sparse_optimized_compressed.h5ad").st_size // (1024**2)
# 117
LuckyMD commented 3 years ago

Damn... you're right... this was default in anndata until not too long ago. The file is just a little older.

I can of course change the uploaded file to remove the counts layer, but then I would have to create a separate Figshare for this... as it would no longer be the file we are using for benchmarking. And when I get around to adding the data integration benchmarking task, we perform preprocessing for that separate from the methods... as there is quite a bit of preprocessing to do. So there I would probably just use the preprocessed data that we have on Figshare anyway. I wonder how rigid we want to be with the data always being provided as counts for each task.

scottgigante commented 3 years ago

I think long term the data for open problems will be on S3 anyway so I imagine the storage will be separate anyway. I lean towards sticking with counts unless there's a very good reason not to since some analyses may use probabilistic assumptions on the counts.

LuckyMD commented 3 years ago

For data integration we provide both adata.X and adata.layers['counts'] and have been working with that so far. Methods that require counts, can get them (but note that not all datasets will have counts available, as some are full-length methods that are typically quoted in TPM or RPKM.

As for S3... I thought the overall data loaders should still be shared if I provide a dataset that might be used across tasks? Then once I add the scIB stuff, I would change the pancreas data loader to move the prep function to the task-specific loader.

scottgigante commented 3 years ago

For data integration we provide both adata.X and adata.layers['counts'] and have been working with that so far.

Sure, but didn't we agree that for this project, we would just provide the raw count matrix (or nearest equivalent) and leave normalization to the method developer?

I thought the overall data loaders should still be shared if I provide a dataset that might be used across tasks?

Yep, that's right -- but starting from raw counts would be true of all tasks, no? All I mean here is that it doesn't necessarily make sense to try to keep the same figshare URL for this project and the data integration project

LuckyMD commented 3 years ago

Sure, but didn't we agree that for this project, we would just provide the raw count matrix (or nearest equivalent) and leave normalization to the method developer?

We did... but I don't think this is feasible with full length scRNA-seq data as I wrote above. We might need to make this task-specific.

All I mean here is that it doesn't necessarily make sense to try to keep the same figshare URL for this project and the data integration project

I agree that we don't have to do this. But in the end, I think we might not be able to start from raw counts all the time for all tasks, as raw counts don't always exist.

scottgigante commented 3 years ago

Sure, that makes sense. I think it is perfectly sensible to treat full-length scRNAseq as qualitatively different to count data.

LuckyMD commented 3 years ago

So if you have an anndata object with both full length and count data in there you can no longer rely on our normalization methods... so you would have to provide it as already normalized data then, no? Unless you want to make a standardized annotations to say this is count or not.

dburkhardt commented 3 years ago

This seems like an extension of the ATAC-seq conversation where we've been assuming UMI counts, but not all "single cell" is 3' UMI counts. I think we want to be careful of letting in too many data types too quickly, but it does seem clear that we need to identify specific "data archetypes" and make it clear which one is being used in each task / dataset

LuckyMD commented 3 years ago

I agree that this discussion sounds very similar. We could define clear "data archetypes" and control this centrally (then normalization can become very difficult if we have combinations of data archetypes), or we make it more malleable/less defined and do this on a task level. I don't think it can be done on a dataset level as the methods in a task should apply to most datasets... so it can't be that sometimes data is pre-normalized and sometimes it is not.

Although I was initially in favour of controlling everything, I now feel like it might be easier to do this on a task level...

scottgigante commented 3 years ago

I lean towards task-level with strong recommendations towards less preprocessing where possible?

LuckyMD commented 3 years ago

I would agree. If we all agree on this (@dburkhardt ?) and would agree that combinations of data archetypes warrant preprocessed versions of the data, then I will upload the new pancreas dataset to figshare in the benchmarking data integration entry, and adapt the pancreas data loader to not throw away the preprocessed data, but do this in the task loader instead to make this more adaptable for the future data integration task.