linnarsson-lab / loompy

Python implementation of the Loom file format - http://loompy.org
BSD 2-Clause "Simplified" License
140 stars 37 forks source link

Loompy file export for Scanpy #1

Closed falexwolf closed 6 years ago

falexwolf commented 7 years ago

Dear Linnarsson Lab, do you think it would make sense to build a loom file export into Scanpy? I haven't thought too much about the file format for Scanpy's AnnData class, which follows similar ideas as compared to the loom file format. Even though, in principle, it's file format achieves similar things than the loom file format - maybe I didn't get everything - we would really like to use some more widely accepted platform-/language-independent way of storing AnnData objects. The scater/scran developers are backing on a hdf5-format for SingleCellExperiment, so we thought about adapting this as it's probably used by many people... Maybe we could have a brief Skype for discussing this? I'm presenting Scanpy at Broad in two weeks and it would be nice to have a good solution for the file format until then. Cheers, Alex PS: AnnData will be implemented in a cleaner way, soon.

slinnarsson commented 6 years ago

Hi

Yes, would be a great idea. I’m travelling and will not have time to Skype. To get started, did you check out the github and docs? Creating a loom file is very easy, basically just follow this recipe: http://loompy.org/loompy-docs/fullapi/loompy.html#loompy.loompy.create

You need a numpy matrix, and two dictionaries (one for row/gene attributes and one for column/cell attributes). Each key is an attribute name, and each value is a numpy 1D array of values, matching the number of rows or columns.

Best

/Sten

-- Sten Linnarsson, PhD Professor of Molecular Systems Biology Karolinska Institutet Unit of Molecular Neurobiology Department of Medical Biochemistry and Biophysics Scheeles väg 1, 171 77 Stockholm, Sweden +46 8 52 48 75 77 (office) +46 70 399 32 06 (mobile)

On 18 Oct 2017, at 09:24, Alex Wolf notifications@github.com wrote:

Dear Linnarsson Lab, do you think it would make sense to build a loom file export into Scanpy? I haven't thought too much about the file format for Scanpy's AnnData class, which follows similar ideas as compared to the loom file format. Even though, in principle, it's file format achieves similar things than the loom file format - maybe I didn't get everything - we would really like to use some more widely accepted platform-/language-independent way of storing AnnData objects. The scater/scran developers are backing on a hdf5-format for SingleCellExperiment, so we thought about adapting this as it's probably used by many people... Maybe we could have a brief Skype for discussing this? I'm presenting Scanpy at Broad in two weeks and it would be nice to have a good solution for the file format until then. Cheers, Alex PS: AnnData will be implemented in a cleaner way, soon.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

falexwolf commented 6 years ago

Hi Sten,

sorry for the very late response. I was looking at the docs you linked and only saw that loompy.loompy.create only accepts np.ndarray as data matrices and no pandas dataframes... So I thought that a quick integration for Scanpy is not possible.

Now, one idea is to use the .loom format with possibly specific naming conventions that are understood both by Scanpy and, e.g. Seurat.

Background: Since early this year, we have been using something similar to .loom for backing the central data container in Scanpy, AnnData https://scanpy.readthedocs.io/en/latest/basic_usage.html

Details on the object for the API are compiled here https://scanpy.readthedocs.io/en/latest/api/scanpy.api.AnnData.html and for it's hdf5 backing here https://scanpy.readthedocs.io/en/latest/api/scanpy.api.write.html

We focused on a convenient way for handling sparse matrices (graphs), unstructured annotation, categorical data with a top-level API that allows the user to not care about any of these issues. Loompy's high-level API http://loompy.org/loompy-docs/fullapi/loompy.html#loompy.loompy.create does not seem to solve the just mentioned issues. By contrast, it demands from the user to use an extensive low level API, http://loompy.org/loompy-docs/fullapi/loompy.html#module-loompy.loompy

So, what I currently have in mind is to let sc.write use filenames with ".loom" endings: if so, data and annotation are automatically written in a way that could be directly read into, e.g., Seurat. The first challenge is to agree on what rows and columns mean (samples or variables/features?) - in ML/Stat rows are always samples and columns variables; for genetic data the opposite convention is used. I'm happy with both of it.

What do you think?

Cheers, Alex

slinnarsson commented 6 years ago

I think it would be great to all standardize on Loom, and I just met with Rahul Satija yesterday who is working on full Loom support in Seurat.

As for integration with scanpy it looks like it will be very easy. AnnData is semantically very close to Loom. Your example of creating an AnnData object translates directly:

data=np.array([[1, 0], [3, 0], [5, 6]])
smp= {
    'row_names': ['name1', 'name2', 'name3'],
    'sanno1': ['cat1', 'cat2', 'cat2'],
    'sanno2': [2.1, 2.2, 2.3]
}
var= {
    'vanno1': [3.1, 3.2]
}
uns= {
    'sanno1_colors': "something",
    'uns2': "something else"
}
adata = AnnData(data, smp, var, uns)
ds = loompy.create(data, smp, var, file_attrs=uns)

(We only support "unstructured" data in the form of strings. For cross-language compatibility, I suggest the best solution is to JSON-encode the data as needed.)

Of course, that code creates an AnnData object in memory, but a Loom file on disk. There's a difference in philosophy: Loom was designed to support out-of-memory operation, while scanpy assumes the entire dataset is loaded in RAM. But this is also an opportunity: you could consider over time adding options to perform scanpy operations out-of-RAM to support very large datasets. And vice versa, for small datasets we also often just load the whole thing in memory.

Now, simplest would be for you to support reading and writing to Loom. But much better would be if you could support Loom as input to each of your functions, and write the result back to the file. That way if you do a PCA, say, you would only need to write the PCA coordinates as two attributes, instead of saving the whole updated AnnData object. E.g. it would be fantastic if we could do

with loompy.connect("filename.loom") as ds:
    scanpy.api.pp.filter_cells(ds)

and this would write a new cell attribute cell_subset to the file.

This kind of support in scanpy, Seurat, possibly pagoda, and our own in-house analysis library (cytograph, not released yet) will allow complete mix and match of algorithms (I'm making up the rpy code but you get the idea):

seurat = robjects.r['seurat']
pagoda = robjects.r['pagoda']
with loompy.connect("filename.loom") as ds:
    scanpy.api.pp.filter_cells(ds)
    cytograph.multiscale_knn(ds)
    seurat.cluster(ds)
    pagoda.diff_expression(ds)

(to avoid reading the whole matrix in memory at each call, Loom should support memoization; I will open a separate issue about this)

As for conventions, we use cells in columns and genes in rows. This is usually the opposite of what scipy expects (unless you're clustering the genes, for example). In order to maximize interoperability, we also need to agree on attribute names, both for metadata and for intermediate results (e.g. the result of a PCA, or filtering, etc). I'll open a separate issue for this too.

falexwolf commented 6 years ago

Hi Sten,

great! Exactly that!

We were planning to move from fully loading AnnDatas into memory to a partial loading from the very beginning (we've written a price-winning pipeline https://github.com/NDKoehler/DataScienceBowl2017_7th_place that does so). But yes, we have not implemented that yet (it was not really necessary as we could run the 1.3 Million cells example without it).

It's perfect if loom can help to simplify the transition. The plan was that for the user, an AnnData object still looks like it does now (a simple slicable container object of arrays, sparse matrices, dataframes with intuitive access via adata[], adata.X, adata.smp[] etc.), but internally, it fills all its attributes from disk. That's the same thing as you suggest when you say: "passing a loom file object to scanpy functions", but does not require the user at the top-level interface to deal with loom files in the background. However, yes! It would be nice to be able to pass loom files directly in order to have that interoperability between different packages and algorithms. We will think about whether using AnnData as an interface for the loom file or allowing passing loom files directly and what would be more convenient for the user.

As for integration with scanpy it looks like it will be very easy. AnnData is semantically very close to Loom.

Yes, that should not be too difficult. However, the minimal constructor you mention does not reflect the full complexity of the AnnData class.

For instance, we found it very efficient to use dataframes for the sample and variable annotation on the object level and structured arrays on the hdf5 level. Still we need to transform categorical data types for efficient storage, as these are not understood by hdf5 - we experimented with different solutions over months but now settled on a combination of an integer column within a structured array combined with a short list of categroies within the unstructured annotation, see the example linked above (Maybe this is also one route to resolve the issue raised by @olgabot. The solution that you propose (calling df.to_list()) does not treat categorical data types appropriately).

This is one reason, why

(We only support "unstructured" data in the form of strings. For cross-language compatibility, I suggest the best solution is to JSON-encode the data as needed.)

is currently a problem. This would mean tearing apart things that belong together and using a very slow file format for it. Also, we store different graph representations of the data within adata.uns, for instance, the "abstracted representations" or single-cell distances vs. normalized weighted graphs. We also think that graphs do not necessarily need to be knn, but can have varying numbers of neighbors.

Why don't you want to store more complicated annotation in file_attrs and call the whole thing unstructured? I don't see why this should violate cross-language compatibility.

Regarding the conventions, I will respond in the other issue. ;)

Thank you for attacking the out-of-memory issues! Joint efforts are much more fun! :smile: Alex

slinnarsson commented 6 years ago

Why don't you want to store more complicated annotation in file_attrs and call the whole thing unstructured? I don't see why this should violate cross-language compatibility.

What I'm not getting is how the more complicated annotation would be stored, unless it's as a string (e.g JSON)? E.g. how would HDF5 support storing this kind of value:

uns= {
    'sanno1_colors': "something",
    'uns2': "something else"
}

And how would you read it into MATLAB or C/C++ ?

falexwolf commented 6 years ago

Sorry for disappearing for a few days... my wife and I are expecting twins and there were some complications (... everything fine now again).

Regarding your question above, simply take another look at the first example below the function definition here, in particular, the output of h5ls.

It's no problem to walk through the uns dictionary and save each key-value pair as a new dataset; either within a group or within the file. Why do you think that it's a problem? Of course, if you have nested dicts, this gets a bit cumbersome and probably error-prone. Currently, we only support non-nested dicts where values need to have hdf5 representations. As mentioned before, this is very helpful for various reasons.

Regarding the "Conventions" discussion. It would be very good to hear the Seurat people's opinion on it. Rahul will visit our lab in a bit more than a week and we wanted to discuss this among other things. Still it would be nice to collect everything in the issue you opened. For instance, I completely agree with the single-branch view you take; I think it's too complicated to store multiple versions of the same data... we thought about making the raw data available in AnnData, but tend not to do it... Depending on what comes out of the file format discussion, we might rethink this.

Specifically for the Scanpy export, just the unstructured annotation is really an issue. Without it, we cannot fully store an AnnData object as a loom file on disk; and that's a bit of a pitty as the loom format could in principle become a full replacement of AnnData's - trivial - hdf5 backing.

Anyhow, even without the support for the unstructured annotation, we will make the loom export possible. I will wait for the discussion with Rahul and how the "Conventions" issue evolves - but that should just be a matter of another two weeks.

slinnarsson commented 6 years ago

Resolved in Loompy v2.0.

We now allow global attributes to be either scalars or N-dimensional arrays of numbers or strings. Unicode is allowed and is round-tripped to HDF5 using XML entity encoding.

Example:

>>> import loompy
>>> with loompy.connect("/Users/sten/build_20171205/L1_Sympathetic.agg.loom") as ds:
>>>    ds.attrs.Movie = "Star wars episode 1"
>>>    ds.attrs.Movies = ["Star wars episode 1", "Star wars episode 2"]
>>>    print(ds.attrs.Movie)
>>>    print(ds.attrs.Movies)

Star wars episode 1
['Star wars episode 1' 'Star wars episode 2']