Closed scottgigante closed 1 year ago
I just changed the URL for the pancreas datasets yesterday I think. I can add the cell and gene filtering in the data loader.
filtered out the 6 genes with 0 counts from the dataset I uploaded in #104. It seems I uploaded a dense matrix, as we used that in the benchmarking. Would you like me to upload a sparse version of this or is it sufficient to make it sparse in the data loader?
I think it makes sense to make the download sparse -- it decreases download time and memory use in loading quite substantially.
I don't understand the reason for requiring all genes to have a non-zero sum in all datasets. Methods should definitely be able to handle datasets where some genes are not expressed.
There is just no reason to include these genes though, right? Plenty of methods do have issues with empty cells or empty genes, so why bother fixing every method when it's easy just to remove them?
I don't know, I would argue that in most contexts where you work with data, features can be all zero without it creating an issue. Measurements of zero are also measurements. Also, depending on how you preprocess a single-cell dataset (filter cells etc.) individuals features can become completely zero or not. It sometimes creates more problems if the feature set changes for each version of the data just because you processed it slightly differently. I think this whole removing genes thing has become a bit of a habit in the field simply because you sometimes start with matrices that contain so many non-coding/unexpressed genes and so many cells that they just don't fit into the memory unless you filter out as many genes as possible. To me that shouldn't mean that every dataset that contains a few all-zero genes should somehow be considered "invalid".
Processing should not create zeros, but rather remove them. I would also argue that a dataset that has 0 count genes or cells is not sufficiently filtered/QC'ed.. unless it's a part of a larger dataset collection that will be merged, in which case this is a desired state. A 0 count gene does not provide any further information for downstream processing, so why keep it? It's ignored in embeddings which typically use HVGs; I can't do DE analysis on it due to lack of defined variance. If I decide to extend my genome with a few very specific gene variants I will have more additional 0 count genes. That doesn't mean I have any more information.
I'm frankly surprised that all of the methods we benchmarked for data integration could deal with genes with 0 counts... I could have bet ComBat would have issue with this as 0 count genes have no definable variance.
@scottgigante so apparently sparsifying the expression data turns the pancreas dataset from 300MB into 850MB. It seems the sparsity of the expression matrix is quite low. Maybe it shouldn't be changed in that case?
I will double check this... might be making a mistake here... who knows.
Confirmed that the 18.1% of the matrix is non-zero. The other issue is that the adata.layers['counts']
are float32
as there are non-integer values in there with full-length protocols that are available as RPKM (Xin et al data). So I don't think this should be provided as sparse tbh.
I would then regard this issue as solved, as filtering was added to the data loader to address the empty genes.
It doesn't make much sense to me that a 18% non-zero sparse matrix would be larger than a dense matrix of the same size. Example:
import numpy as np
import scipy.sparse
import pickle
import os
X = np.zeros((5000, 1000), dtype=np.float32)
X[:,:180] = np.random.normal(0, 1, (5000, 180))
with open("dense.pkl", "wb") as f:
pickle.dump(X, f)
os.stat("dense.pkl").st_size // 1024
# 19531
X_sparse = scipy.sparse.csr_matrix(X)
with open("sparse.pkl", "wb") as f:
pickle.dump(X_sparse , f)
os.stat("sparse.pkl").st_size // 1024
# 7051
Here's my attempt at sparsifying the pancreas data:
import openproblems
adata = openproblems.data.pancreas.load_pancreas(test=False)
adata.write_h5ad("dense.h5ad")
os.stat("dense.h5ad").st_size // (1024**2)
# 1170
adata.X = scipy.sparse.csr_matrix(adata.X)
adata.write_h5ad("sparse.h5ad")
os.stat("sparse.h5ad").st_size // (1024**2)
# 426
adata.X.indices.max()
18696
np.iinfo("uint16").max
65535
adata.X.indices = adata.X.indices.astype(np.uint16)
adata.X.data = adata.X.data.astype(np.float16)
adata.obs['n_counts'] = adata.obs['n_counts'].astype(np.float16)
adata.var['n_counts'] = adata.var['n_counts'].astype(np.float16)
os.stat("sparse_optimized.h5ad").st_size // (1024**2)
# 214
Not sure what to tell you... but I've spent the past hour or so trying to figure out what else it could be... but I'm just doing a adata.X = scipy.sparse.csr_matrix(adata.X)
and the same for the counts layer, and I'm getting sth that's more than twice the size after sparsifying... All I'm doing is the above... Not sure if scipy.sparse.csr_matrix()
by default converts to float64
or sth like that... but otherwise I'm out of ideas tbh...
is that 1.17GB on disk for the dense matrix? The object I'm building this from is ~300MB also on Figshare...
Maybe there's something in adata.raw
that is being thrown away in openproblems
that causes this? It's very confusing either way...
In [29]: adata_panc = sc.read("/storage/groups/ml01/workspace/scIB/pancreas_joint/pancreas_jointnorm.h5ad")
In [37]: os.stat("pancreas_jointnorm.h5ad").st_size // 1024**2
Out[37]: 301
In [38]: from scipy import sparse
In [39]: adata_panc.X = sparse.csr_matrix(adata_panc.X)
In [40]: adata_panc.layers['counts'] = sparse.csr_matrix(adata_panc.layers['counts'])
In [41]: adata_panc.X
Out[41]:
<16382x19093 sparse matrix of type '<class 'numpy.float32'>'
with 55644200 stored elements in Compressed Sparse Row format>
In [42]: pwd
Out[42]: '/mnt/znas/icb_zstore01/groups/ml01/workspace/scIB/pancreas_joint'
In [43]: adata_panc.write("test_sparse.h5ad")
In [45]: os.stat("test_sparse.h5ad").st_size // 1024**2
Out[45]: 851
I'm doing the same thing you are... it's just not working for me ^^. This is a slightly different object though, as I haven't done any preprocessing yet... Mine is just the object from Figshare.
Looks like your initial file was compressed, which is why it's small on disk but large in memory. Also, since we remove adata.layers["counts"] in the loader, we probably shouldn't include it in the initial file.
import anndata
import os
import scipy.sparse
import numpy as np
os.stat("human_pancreas_norm_complexBatch.h5ad").st_size // (1024**2)
# 301
adata = anndata.read_h5ad("human_pancreas_norm_complexBatch.h5ad")
adata.write_h5ad("dense.h5ad")
os.stat("dense.h5ad").st_size // (1024**2)
# 2388
adata.write_h5ad("dense_compressed.h5ad", compression='gzip', compression_opts=9)
os.stat("dense_compressed.h5ad").st_size // (1024**2)
# 271
adata.X = scipy.sparse.csr_matrix(adata.layers["counts"])
del adata.layers["counts"]
adata.write_h5ad("sparse.h5ad")
os.stat("sparse.h5ad").st_size // (1024**2)
# 426
adata.write_h5ad("sparse_compressed.h5ad", compression='gzip', compression_opts=9)
os.stat("sparse_compressed.h5ad").st_size // (1024**2)
# 136
adata.X.indices = adata.X.indices.astype(np.uint16)
adata.X.data = adata.X.data.astype(np.float16)
adata.obs['size_factors'] = adata.obs['size_factors'].astype(np.float16)
adata.write_h5ad("sparse_optimized.h5ad")
os.stat("sparse_optimized.h5ad").st_size // (1024**2)
# 214
adata.write_h5ad("sparse_optimized_compressed.h5ad", compression='gzip', compression_opts=9)
os.stat("sparse_optimized_compressed.h5ad").st_size // (1024**2)
# 117
Damn... you're right... this was default in anndata until not too long ago. The file is just a little older.
I can of course change the uploaded file to remove the counts layer, but then I would have to create a separate Figshare for this... as it would no longer be the file we are using for benchmarking. And when I get around to adding the data integration benchmarking task, we perform preprocessing for that separate from the methods... as there is quite a bit of preprocessing to do. So there I would probably just use the preprocessed data that we have on Figshare anyway. I wonder how rigid we want to be with the data always being provided as counts for each task.
I think long term the data for open problems will be on S3 anyway so I imagine the storage will be separate anyway. I lean towards sticking with counts unless there's a very good reason not to since some analyses may use probabilistic assumptions on the counts.
For data integration we provide both adata.X
and adata.layers['counts']
and have been working with that so far. Methods that require counts, can get them (but note that not all datasets will have counts available, as some are full-length methods that are typically quoted in TPM or RPKM.
As for S3... I thought the overall data loaders should still be shared if I provide a dataset that might be used across tasks? Then once I add the scIB stuff, I would change the pancreas data loader to move the prep function to the task-specific loader.
For data integration we provide both adata.X and adata.layers['counts'] and have been working with that so far.
Sure, but didn't we agree that for this project, we would just provide the raw count matrix (or nearest equivalent) and leave normalization to the method developer?
I thought the overall data loaders should still be shared if I provide a dataset that might be used across tasks?
Yep, that's right -- but starting from raw counts would be true of all tasks, no? All I mean here is that it doesn't necessarily make sense to try to keep the same figshare URL for this project and the data integration project
Sure, but didn't we agree that for this project, we would just provide the raw count matrix (or nearest equivalent) and leave normalization to the method developer?
We did... but I don't think this is feasible with full length scRNA-seq data as I wrote above. We might need to make this task-specific.
All I mean here is that it doesn't necessarily make sense to try to keep the same figshare URL for this project and the data integration project
I agree that we don't have to do this. But in the end, I think we might not be able to start from raw counts all the time for all tasks, as raw counts don't always exist.
Sure, that makes sense. I think it is perfectly sensible to treat full-length scRNAseq as qualitatively different to count data.
So if you have an anndata object with both full length and count data in there you can no longer rely on our normalization methods... so you would have to provide it as already normalized data then, no? Unless you want to make a standardized annotations to say this is count or not.
This seems like an extension of the ATAC-seq conversation where we've been assuming UMI counts, but not all "single cell" is 3' UMI counts. I think we want to be careful of letting in too many data types too quickly, but it does seem clear that we need to identify specific "data archetypes" and make it clear which one is being used in each task / dataset
I agree that this discussion sounds very similar. We could define clear "data archetypes" and control this centrally (then normalization can become very difficult if we have combinations of data archetypes), or we make it more malleable/less defined and do this on a task level. I don't think it can be done on a dataset level as the methods in a task should apply to most datasets... so it can't be that sometimes data is pre-normalized and sometimes it is not.
Although I was initially in favour of controlling everything, I now feel like it might be easier to do this on a task level...
I lean towards task-level with strong recommendations towards less preprocessing where possible?
I would agree. If we all agree on this (@dburkhardt ?) and would agree that combinations of data archetypes warrant preprocessed versions of the data, then I will upload the new pancreas dataset to figshare in the benchmarking data integration entry, and adapt the pancreas data loader to not throw away the preprocessed data, but do this in the task loader instead to make this more adaptable for the future data integration task.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Minimal code sample (that we can copy&paste without having any data)