Open mshadbolt opened 4 years ago
Hi
Sorry for late reply. I'm not sure why the file size would be so inflated. Does it have the expected number of columns afterwards?
You can simplify your code a little using numpy comparisons and direct attribute access:
blood_indices = ds.ca.derived_organ_label == "blood"
This makes a numpy bool array which can be used for indexing. But probably this makes no difference to your problem.
It may be that the original file contains a large number of variable-length strings. If you are not on the latest loompy, those might get stored as fixed-length, which would inflate the file. Try installing the current development version of loompy. First clone the repository locally (git clone https://github.com/linnarsson-lab/loompy.git
). Then cd loompy
and install it like so:
pip install -e .
(don't miss the period at the end)
Hi
We've seen the same problem now, and in our case a 50 GB file ended up 2.75 TB... so it's definitely a bug somewhere. It seems to be due to some low-level issues inside HDF5 (or h5py), which I'm not able to debug.
However, using the loompy.combine_faster() method gets around the problem and is (much) faster. See if you could use it. In your case you'd be "combining" a single file, but that's fine. You can supply a cell selection to achieve the desired output. Something like:
loompy.combine_faster([loom_path], out_file, [blood_indices], key="Accession")
(you can omit the key
if the input is a single file as here, but it's important when you combine multiple files to ensure the rows are in the same order)
Hi, thanks for following up on this.
I will definitely try it out next time I need to perform this kind of operation.
Hi, thanks for the awesome package. I am trying to split a loom file based on a column attribute. Specifically I have downloaded the loom file from the Fetal Maternal Interface project from the HCA (https://data.humancellatlas.org/explore/projects/f83165c5-e2ea-4d15-a5cf-33f3550bffde/expression-matrices). It is a large file with dimensions (58347, 546183), and total size 1.69GB
First I get a np.array of the indices of all columns that have
blood
as thederived_organ_label
I then attempted to adapt the code in the documentation for scanning files and outputting to a new file as follows:
The code appears to work, it took around 2 hours, but the size of the output loom file size quickly inflates, it ended up being 54GB in total when finished.
I just wanted to check:
Thanks