linnarsson-lab / loompy

Python implementation of the Loom file format - http://loompy.org
BSD 2-Clause "Simplified" License
139 stars 37 forks source link

Filtering a loom file by a column attribute - huge output loom #123

Open mshadbolt opened 4 years ago

mshadbolt commented 4 years ago

Hi, thanks for the awesome package. I am trying to split a loom file based on a column attribute. Specifically I have downloaded the loom file from the Fetal Maternal Interface project from the HCA (https://data.humancellatlas.org/explore/projects/f83165c5-e2ea-4d15-a5cf-33f3550bffde/expression-matrices). It is a large file with dimensions (58347, 546183), and total size 1.69GB

First I get a np.array of the indices of all columns that have blood as the derived_organ_label

import loompy
import numpy as np
with loompy.connect(loom_path) as ds:
    blood_indices = np.array([i for i, x in enumerate(ds.ca["derived_organ_label"]) if x == "blood"])

I then attempted to adapt the code in the documentation for scanning files and outputting to a new file as follows:

with loompy.new(out_file) as dsout:  # Create a new, empty, loom file
    with loompy.connect(loom_path) as ds:
      for (ix, selection, view) in ds.scan(items=blood_indices, axis=1):
        dsout.add_columns(view.layers, col_attrs=view.ca, row_attrs=view.ra)

The code appears to work, it took around 2 hours, but the size of the output loom file size quickly inflates, it ended up being 54GB in total when finished.

I just wanted to check:

  1. Am I doing something wrong
  2. Is there is a better way to do this to keep the file size smaller
  3. Is this increase in size is expected?
  4. Is there a way to reduce the size of the output loom file?

Thanks

slinnarsson commented 4 years ago

Hi

Sorry for late reply. I'm not sure why the file size would be so inflated. Does it have the expected number of columns afterwards?

You can simplify your code a little using numpy comparisons and direct attribute access:

blood_indices = ds.ca.derived_organ_label == "blood"

This makes a numpy bool array which can be used for indexing. But probably this makes no difference to your problem.

It may be that the original file contains a large number of variable-length strings. If you are not on the latest loompy, those might get stored as fixed-length, which would inflate the file. Try installing the current development version of loompy. First clone the repository locally (git clone https://github.com/linnarsson-lab/loompy.git). Then cd loompy and install it like so:

pip install -e .

(don't miss the period at the end)

slinnarsson commented 4 years ago

Hi

We've seen the same problem now, and in our case a 50 GB file ended up 2.75 TB... so it's definitely a bug somewhere. It seems to be due to some low-level issues inside HDF5 (or h5py), which I'm not able to debug.

However, using the loompy.combine_faster() method gets around the problem and is (much) faster. See if you could use it. In your case you'd be "combining" a single file, but that's fine. You can supply a cell selection to achieve the desired output. Something like:

loompy.combine_faster([loom_path], out_file, [blood_indices], key="Accession")

(you can omit the key if the input is a single file as here, but it's important when you combine multiple files to ensure the rows are in the same order)

mshadbolt commented 4 years ago

Hi, thanks for following up on this.

I will definitely try it out next time I need to perform this kind of operation.