mojaveazure / seurat-disk

Interfaces for HDF5-based Single Cell File Formats
https://mojaveazure.github.io/seurat-disk
GNU General Public License v3.0
156 stars 50 forks source link

Issues when subsetting and converting loom to seurat #132

Open HenriettaHolze opened 2 years ago

HenriettaHolze commented 2 years ago

Hi,

I would like to work with the recently published adult human brain atlas which contains 3M cells https://www.biorxiv.org/content/10.1101/2022.10.12.511898v1.full .

When I connect to the loom file and want to convert it to a Seurat object using SeuratDisk's as.Seurat() function I run into memory issues (1.5T required).
Do you have tips how to work with such huge datasets with Seurat?

I tried to use the subset function from loomR to get a random subset of cells and save it as a new loom file to then be able to convert that one to Seurat.

loom_obj <-
  Connect(
    filename = here(
      "adult_human_20221007.loom"
    ),
    mode = "r"
  )
# 50k cells takes 1.5h
# subset.cells <- sample.int(3369219, 50000)
subset.cells <- c(1, 2)
subset(loom_obj, m = subset.cells, filename = here(
  "adult_human_20221007_subset.loom"
))

I got following error:

Writing new loom file to adult_human_20221007_subset.loom
Adding data for /matrix
No layers found
  |===================================================================================================================| 100%
Error in new.loom$create_group(name = group) : HDF5-API Errors:
    error #000: ../../../src/H5G.c in H5Gcreate2(): line 323: unable to create group
        class: HDF5
        major: Symbol table
        minor: Unable to initialize object

    error #001: ../../../src/H5Gint.c in H5G__create_named(): line 157: unable to create and link to group
        class: HDF5
        major: Symbol table
        minor: Unable to initialize object

    error #002: ../../../src/H5L.c in H5L_link_object(): line 1572: unable to create new link to object
        class: HDF5
        major: Links
        minor: Unable to initialize object

    error #003: ../../../src/H5L.c in H5L__create_real(): line 1813: can't insert link
        class: HDF5
        major: Links
        minor: Unable to insert object

    error #004: ../../../src/H5Gtraverse.c in H5G_traverse(): line 851: internal path traversal failed
        class: HDF5
        major: Symbol table
        minor: Object not found

    error #005: ../../../src/H5Gtraverse.c in H5G__traver

The loom object looks as follows:

Class: loom
Filename: adult_human_20221007.loom
Access type: H5F_ACC_RDONLY
Attributes: last_modified
Listing:
       name    obj_type    dataset.dims dataset.type_class
      attrs   H5I_GROUP            <NA>               <NA>
  col_attrs   H5I_GROUP            <NA>               <NA>
 col_graphs   H5I_GROUP            <NA>               <NA>
     layers   H5I_GROUP            <NA>               <NA>
     matrix H5I_DATASET 3369219 x 59480          H5T_FLOAT
  row_attrs   H5I_GROUP            <NA>               <NA>
 row_graphs   H5I_GROUP            <NA>               <NA>
> loom_obj[["attrs/LOOM_SPEC_VERSION"]][]
[1] "3.0.0"

I use SeuratDisk v0.0.0.9019 and loomR v0.2.1.9000.

zamlerd commented 2 years ago

I am having the same issue, with the same dataset-

Hello,

I am having an issue subsetting a publicly available loom dataset here - https://console.cloud.google.com/storage/browser/linnarsson-lab-human;tab=objects?authuser=0&pli=1&prefix=&forceOnObjectsSortingFiltering=false

when trying to subset out a certain cluster

library(dplyr)
library(hdf5r)
library(loomR)
loom <- connect(filename = "~/Downloads/adult_human_20221007.loom", mode = "r+", skip.validate = TRUE)
attr.df <- loom$get.attribute.df(MARGIN = 2, col.names = "CellID", row.names = "Gene")
subset(loom, m = attr.df$Clusters == "298", filename = 'CBL.loom', chunk.size = 1000, verbose = T, overwrite = T)

I get the below error

Writing new loom file to CBL.loom Error in H5File.open(filename, mode, file_create_pl, file_access_pl) : HDF5-API Errors: error #000: ../../src/hdf5-1.12.1/src/H5F.c in H5Fcreate(): line 532: unable to create file class: HDF5 major: File accessibility minor: Unable to open file

error #001: ../../src/hdf5-1.12.1/src/H5VLcallback.c in H5VL_file_create(): line 3282: file create failed
    class: HDF5
    major: Virtual Object Layer
    minor: Unable to create file

error #002: ../../src/hdf5-1.12.1/src/H5VLcallback.c in H5VL__file_create(): line 3248: file create failed
    class: HDF5
    major: Virtual Object Layer
    minor: Unable to create file

error #003: ../../src/hdf5-1.12.1/src/H5VLnative_file.c in H5VL__native_file_create(): line 63: unable to create file
    class: HDF5
    major: File accessibility
    minor: Unable to open file

error #004: ../../src/hdf5-1.12.1/src/H5Fint.c in H5F_open(): line 1858: unable to truncate a file which is already open
    class: HDF5
    major: File ac

I am able to see things in the loom dataset, the output of attr.df %>% colnames

is
[1] "Age" "CellCycle" "CellID" "Chemistry" "Clusters"
[6] "Donor" "DoubletFinderFlag" "DoubletFinderScore" "MT_ratio" "NGenes"
[11] "ROIGroupCoarse" "ROIGroupFine" "Roi" "SampleID" "Subclusters"
[16] "Tissue" "TotalUMI" "unspliced_ratio"

Which I know to be correct based on viewing with HDFView

Any help is appreciated

HenriettaHolze commented 2 years ago

@zamlerd I gave up on subsetting the loom object in R and switched to loompy which is also used by the authors of the data.

The loompy tutorial describes subsetting with loompy.new() and scan() but that threw me an error.

A simple downsampling worked for me this way:

import loompy
import numpy as np

input_file = "adult_human_20221007.loom"
out_file = "adult_human_20221007_downsampled.loom"
with loompy.connect(input_file) as ds:
        # getting 50k random indices
    ind_oi = np.random.choice(list(range(ds.shape[1])), 50000, replace=False)
    ind_oi.sort()

    # initiate the output file with 2 cells
    ds_subset = ds[:, ind_oi[:2]]
    loompy.create(filename=out_file, layers=ds[:, ind_oi[:2]], file_attrs=ds.attrs, col_attrs=ds.ca[ind_oi[:2]], row_attrs=ds.ra)

    ind_oi = ind_oi[2:]

        # connect to the output file
    with loompy.connect(out_file) as dsout:
                # subset the input file in batches and write the subset of cells to the output file
        for (ix, selection, view) in ds.scan(items=ind_oi, axis=1, batch_size=50000):
            dsout.add_columns(view.layers, col_attrs=view.ca, row_attrs=view.ra)

I guess your subsampling would work like this

ind_oi = np.where(ds.ca["Clusters"] == "298")[0]
zamlerd commented 2 years ago

@HenriettaHolze Thank you so much!

I was attempting the same and have been banging my head against a wall

for some reason when trying you code I get the error below

  ind_oi = np.where(ds.ca["Clusters"] == "298")[0]

<loompy.attribute_manager.AttributeManager object at 0x177d2c4c0>

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [24], in <cell line: 3>()
      9     ds_subset = ds[:, ind_oi[:2]]
     10     print(ds.ca)
---> 11     loompy.create(filename=out_file, layers=ds[:, ind_oi[:2]], file_attrs=ds.attrs, col_attrs=ds.ca[ind_oi[:2]], row_attrs=ds.ra)
     13     ind_oi = ind_oi[2:]
     15         # connect to the output file

File ~/opt/anaconda3/lib/python3.9/site-packages/loompy/attribute_manager.py:83, in AttributeManager.__getitem__(self, thing)
     81     am = AttributeManager(None, axis=self.axis)
     82     for key, val in self.items():
---> 83         am[key] = val[thing]
     84     return am
     85 elif type(thing) is tuple:
     86     # A tuple of strings giving alternative names for attributes

TypeError: 'NoneType' object is not subscriptable

Any further hints?

I also tried it with just the random sampling as you did and got the same error

HenriettaHolze commented 2 years ago

I'm not entirely sure what happened there. I had issues with the encoding and had to run these lines in the terminal before starting python or running the script.

export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8

If that does not solve the error, maybe take it to the loompy repo https://github.com/linnarsson-lab/loompy/issues

zamlerd commented 2 years ago

Thank you so much @HenriettaHolze I am relying on you for other peoples packages haha so sorry for that-

I tried running those lines and rebooting and have the same issue-

I will port over to them,

Thanks again!