linnarsson-lab / adult-human-brain

Cytograph version used for adult human-brain analysis
BSD 2-Clause "Simplified" License
65 stars 11 forks source link

Subsetting the atlas #1

Open ccruizm opened 1 year ago

ccruizm commented 1 year ago

Good day,

First of all, thank you for making this incredible resource! And also for making it accessible already. I have downloaded the loom file and have tried to load and convert it into AnnData using sc.read_loom. However, I have not succeeded yet. Since the function will load the file into memory, I am running into memory issues (the last attempt was in an HPC with 512GB RAM).

I am not familiar with working with loom files. I have had a look at the documentation and trying to subset the loom file to extract the gene x cell matrix and the associated metadata to later work with it in Scanpy or Suerat, but I am not sure I am doing it correctly.

Thus far I am trying to get the matrix first (which at this moment is still running and do not know what the outcome is)

counts = ds[:, (ds.ca.ROIGroupCoarse == "Hypothalamus") | 
  (ds.ca.ROIGroupCoarse == "Amygdala")]

But I am not sure how I can subset later only the metadata associated with this subset (clusters, subclusters, cell IDs, etc). Could you please tell me what would be the best way to do it?

Thank you in advance!

slinnarsson commented 1 year ago

The file contains something like 3.3 million cells by 58,000 genes, i.e. about 2*10^11 datapoints. Each datapoint is uint16, i.e. 2 bytes, for a grand total of about 400 GB of data. Maybe scanpy converts to float32 or even float64, which will double or quadruple the size.

Even if stored as a sparse matrix, it will be very large. You can get a sparse matrix by doing

x = ds[””].sparse()

To subset other data, just do the same as you did for the matrix:

labels = ds.ca.Clusters[(ds.ca.ROIGroupCoarse == "Hypothalamus") | (ds.ca.ROIGroupCoarse == "Amygdala")]

Archanayd commented 1 year ago

Hi, Thanks for the amazing resource and making the data available to the public. I have a similar query. What is the best way to read+write 'adult_human_20221007.loom' since its so big in size? In general, I would be happy to just get count matrices and associated metadata for major cell types such as Microglia, Oligodendrocytes, etc. I am also not very familiar with working with loom files.

Thanks again Archana

slinnarsson commented 1 year ago

Docs are here: http://loompy.org

If you want to load into memory as a dense matrix, you will need ~400GB RAM because that's how large the matrix is. You can load it as a sparse matrix according to the code above, which should be much smaller but still large (I guess, 40 GB?).

The best way to work with loom files this large is probably to not try to load it all in memory. The docs contain many examples of how to do that. For example, to load the expression vector for ACTB across all cells:

ds[ds.ra.Gene == "ACTB", :]

To load the expression vector for a specific cell:

ds[:, ds.ca.CellID == "AAACATACATTCTC-1"]

To get the expression matrix for all cells in cluster 132:

x = ds[:, ds.ca.Clusters == 132]

(you can filter all the metadata in the same way)

To fit a PCA to the full dataset, without loading it all in RAM:

from sklearn.decomposition import IncrementalPCA
genes = (ds.ra.Selected == 1)
pca = IncrementalPCA(n_components=50)
for (ix, selection, view) in ds.scan(axis=1):
    pca.partial_fit(view[genes, :].transpose())
rlittman16 commented 1 year ago

I wanted to subset the loom to each individual tissue. Either to save as a scanpy h5ad or save as a loom file to be read into scanpy. Do you have a way of doing that?

slinnarsson commented 1 year ago

Check out the scan() method, which provides an automatic way of scanning through the loom file and all its attributes:

for (ix, selection, view) in ds.scan(items=(ds.ca.ROIGroupCoarse == "Hypothalamus"), axis=1):
   # 'view' now contains a batch of cells from Hypothalamus, with all the corresponding attributes
   # You can append each batch of cells (and any column attributes you want) to a HDF5 file or any other target

The scan() method returns a sequence of views, which are like a virtual mini-loom file containing just a batch of cells (out of the selected items), so that you can easily scan through arbitrarily large files without loading them fully into RAM.

rlittman16 commented 1 year ago

I get the following error with the above code.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [10], in <cell line: 1>()
      2 print(ds.ca.keys())
      3 print(ds.shape)
----> 5 for (ix, selection, view) in ds.scan(items=(ds.ca.ROIGroupCoarse == "Hypothalamus"), axis=1):
      6     print(ix)
      7     print(selection)

File ~/project-rlittman/miniconda3/envs/scing/lib/python3.8/site-packages/loompy/loompy.py:629, in LoomConnection.scan(self, items, axis, layers, key, batch_size, what)
    627 if "col_attrs" in what:
    628     if selection is not None:
--> 629         ca = self.ca[ix + selection]
    630     else:
    631         ca = self.ca[ix: ix + cols_per_chunk]

File ~/project-rlittman/miniconda3/envs/scing/lib/python3.8/site-packages/loompy/attribute_manager.py:83, in AttributeManager.__getitem__(self, thing)
     81     am = AttributeManager(None, axis=self.axis)
     82     for key, val in self.items():
---> 83         am[key] = val[thing]
     84     return am
     85 elif type(thing) is tuple:
     86     # A tuple of strings giving alternative names for attributes

TypeError: 'NoneType' object is not subscriptable
Ararder commented 1 year ago

FYI, for anyone having the above issue, I was able to fix this by running. del ds.ca['Roi'] del ds.ca['Tissues'] Seems like these variables are missing/corrupted when reading the file with loompy. If removed, scan works as expected.

Ararder commented 1 year ago
  1. I'm not sure exactly what you are asking here. Maybe post the code and some info about the error? And I have no idea why are you deleting ROIGroupCoarse and ROIGroupFine. Those works perfectly for me.

  2. By reading the error message and going through the source code.

  3. I'm new to loom, so I cannot give you a good answer. These variables are there, but for some reason loompy, file reading is not working for those. With R, I was able to extract those variables just fine. h5f <- H5Fopen("adult_human_20221007.loom") roi <- h5f$col_attrs$Roi cellID <- h5f$col_attrs$CellID df = tibble(CellID = as.data.frame(cellID)[[1]], roi = as.data.frame(roi)[[1]]) write_tsv(df, "adult_human_20221007_roi.tsv")