linnarsson-lab / loompy

Python implementation of the Loom file format - http://loompy.org

BSD 2-Clause "Simplified" License

140 stars 37 forks source link

Conventions #19

Closed slinnarsson closed 6 years ago

slinnarsson commented 7 years ago

Inviting @satijalab, @mojaveazure, @pkharchenko @falexwolf @gioelelm @pl-ki @JobLeonard to discuss:

You're all invited because of your expertise writing single-cell pipelines. I think if we could all agree on a file format and some conventions, we could make our respective code much more interoperable. At a minimum, this would require each library to support reading and writing a common file format. I propose we go with Loom. :-)

But it will be extremely wasteful to write the whole dataset to file if you only added an attribute (e.g. a cluster ID). Better instead if each library just reads and writes the necessary attributes. How great wouldn't it be if we could easily do this (and vice versa in R):

seurat = robjects.r['seurat']
pagoda = robjects.r['pagoda']
with loompy.connect("filename.loom") as ds:
    scanpy.api.pp.filter_cells(ds)
    cytograph.multiscale_knn(ds)
    seurat.cluster(ds)
    pagoda.diff_expression(ds)

(Of course, the code is made-up, but you get the idea)

To get there, we would need to agree on some conventions:

Orientation of the dataset: rows versus columns
Conventions for primary keys (maybe)
Conventions for names of commonly used attributes

To get started, here's a proposal based on our current conventions, slightly cleaned up:

Orientation

Rows are genes, columns are cells.

Column attributes

Cluster (the cluster ID, int label from 0 to n_clusters) CellID (string; unique, not really used, but assumed to be present) X, Y, Z (embedding, e.g. tSNE or PCA; float; Z is optional) Valid (0 or 1, indicating valid cells)

Row attributes

Gene (gene symbol, string human-readable, not unique) Accession (string, computer-readable, unique) Selected (e.g. high-variance genes, 1 or 0) Valid (passed some minimal criteria for being valid, 1 or 0)

Graphs on cells

KNN (the knn graph) MKNN (the mutual KNN graph)

Graphs on genes

(none)

We also have conventions for layers, used e.g. for gene enrichment. For example, an aggregate Loom file would have one column per cluster instead of per cell, and would have layers such as enrichment, enrichment_q and trinarization, which give various metrics per gene and cluster. Not sure if it makes sense to standardize those, but please let us know what your pipelines might need along these lines.

Note: I'm opening the discussion here, but these are really intended to be conventions for analysis libraries and visualization tools, not Loom the file format. Loom itself is intended to be more general, so we will not impose on it e.g. that there will always be a Gene attribute. We may want to use loom for totally different omics datasets in the future.

falexwolf commented 7 years ago

Thank you, Sten! This discussion is exactly what we need, I think.

Orientation: I'd vote for rows to be cells and columns to be genes, as in the modern classics of Statistics Hastie et al. (2009) and Machine Learning Murphy (2012), in dataframes both in R and Python and the established machine learning and statistics packages in Python (statsmodels, scikit-learn). But I acknowledge that for genomic data, the tradition is the opposite and I'm happy to conform with that, if most of you think it this is the better choice.

Other annotation/attributes: The naming conventions are fine! But I'd suggest that one also allows arbitrary multi-dimensional annotation. The PCA representation of data can easily have 50 dimensions. It is not very intuititve to label all of these by X, Y, ... or X_pca1, X_pca2, .... Same for the PCs themselves (PC1, PC2, etc.), which are gene attributes. See what I've written below on how we solve this within AnnData.

Graphs: KNN implies that there is a fixed degree k per node and one might go beyond that in the future (btw: What is the MKNN graph?). We might also have graphs whose weights correspond to distances and graphs whose weights correspond to probabilities and similar issues. But this could be distinguished simply by providing a suffix, e.g. KNN_distances. What about 'cell_graph', 'CellGraph', 'SingleCellGraph' or 'neighborhood_graph'?

I agree that the loom file format should stay as general as possible. For my taste, even more general. It would be nice if loom can at least cope with what AnnData can store, which is otherwise very similar to loom: We allow arbitrary structured and unstructured annotation/metadata/attributes, and only distinguish inherently one-dimensional (numerical or categorical annotation), multi-dimensional (something like the PCA representation of data 'X_pca', mentioned above) and unstructured annotation. For instance, graphs (e.g., single-cell graph: sparse matrix whose node attributes are cell attributes) do not fit into the data/annotation/attributes block scheme and hdf5 does not support sparse matrices. Hence graphs appear as "unstructured annotation" within AnnData - even though they are sliced (subsampling, indexing, etc.). Also, we make use of the unstructured annotation for categorical data types, for non-single-cell graphs, etc.. But this is now turning into the specifics of how Scanpy/AnnData objects can be represented as loom files: https://github.com/linnarsson-lab/loompy/issues/1#issuecomment-344525924 as opposed to the inherent hdf5-format of AnnData.

I guess similar issues arise for Seurat objects, Pagoda objects and others. I'd be interested in reading about these issues so that we can all find good best practices.

Cheers, Alex

slinnarsson commented 7 years ago

Thanks Alex!

2D attributes: Yes, this makes sense and would be easy to implement.

Graphs: In Loom, graphs are stored as (row, col, weight) arrays, equivalent to the coo format from scipy.sparse. This supports arbitrary weighted, directed and undirected graphs (even multigraphs, I suppose). MKNN is a mutual KNN, i.e. a KNN with non-mutual edges removed. It doesn't have a fixed K any longer of course.

Unstructured data: can you explain in more detail what you allow as unstructured, and how that is written and read to the HDF5 format?

pkharchenko commented 7 years ago

Thank you for staring the discussion, Sten. It would indeed be great to have an interoperable format. It would obviously take more to agree on the details, but I first wanted to ask about the global structure.

There are multiple ways to analyze a given dataset, and often we want to maintain multiple results. In pagoda2 we found it useful to have a notion of dataset "reductions". These are essentially processed views of the dataset, such as different normalizations, projection of cells onto PCs, or high-dimensional diffusion map (similar to multi-dimensional attributes Alex mentioned). Some reductions have functional meanings, for instance regressing out cell cycle. A dataset "view" would be a better name.

However the key structural impact of these is that individual analysis steps, such as clustering or embeddings, can be then be ran on different reductions. Because of this, we also keep track of which view of the data was used to generate subsequent results. So that the dataset has multiple views, each potentially with different corresponding cluster or embedding results. Do you envision some metadata that would tie annotations and embeddings to different dataset views?

Misc questions/comments:

How would we maintain info on dataset normalization? Right now we do variance normalization with a constant gene-wise factor (attribute). Other normalizations would be easier to represent as a new matrix (layer). Should we have a convention for these?
How about hierarchical clustering results? Fuzzy clustering could be represented as an attribute matrix, I guess.
Extensions. Would loom format allow for pipeline-specific slots?
Row vs. column-orientation. We use genes as columns as column access is faster and we query for individual genes more often than columns. This was also an advantage in differential expression testing, as it allowed for quick parallelization of per-gene tests. Many of these considerations, however, had to do with being able to being able to access static results from a web browser over http. Given complexity of hdf5, this doesn't look like a realistic scenario.
Graphs (also diffusion maps, etc.). Again, we're fans of sparse matrices, which make for great graph representations. Short of that, I would suggest using a table representation with each row along the lines of: (e1,e2,w,[attrs]) , where e1 and e2 are elements (cells or genes), w is some weight, and attrs could store optional attribute in extra columns.

slinnarsson commented 7 years ago

Views: So let me rephrase and see if I understand - you want to create an attribute ClusterID_1 (for example) and specify that it was derived by clustering on (say) the PCA attribute (a 2D array)? In our pipeline that kind of information is implicit, not explicit, and we store it in separate log files.

But more generally this also boils down to the question of having multiple analysis results. I.e. instead of standardizing on a single Cluster attribute for cluster labels, we might want to have multiple such attributes, and subsequent analyses would need to know which one to use.

How about we make standard naming schemes in addition to standard names? For example, cluster labels could be named ClusterID by default, and any additional cluster labels would be named ClusterID_{something} (e.g. ClusterID_ByLouvain). This will require each method that uses cluster labels to take an argument that specifies which cluster label to use (with default value Clusters). I like this idea, as it would also make it possible to adapt to any non-conventional attribute names as well.

Orientation: is column access faster? It never occurred to me to check. The matrix is block-compressed so I would have thought it would be symmetric.

Normalization: in our own work, we always recalculate things like normalization and PCA, for every step, and just store raw data and results in the Loom file. This is wasteful of course but simplifies the pipeline. Probably the only representation of normalization that would be easy to implement across pipelines would be to store a full normalized matrix explicitly. I would vote for creating a new file in this case, so that the normalized data can be input to any algorithm, and the algorithm doesn't need to know which specific layer to use. That's also fine. 1.3 million cells from 10X is just 38 GB so it's not terribly inefficient.

Here more generally also to Alex points above, there is the question of what should be stored in file and what should be stored only transiently in memory. Anything needed for interoperabiity between pipelines would have to be stored in the file. But some things may be better recomputed at each step, because they are closely tied to each pipeline.

Extensions: Yes good idea, we should make a HDF5 group for that, e.g. /extensions/pagoda etc.

pkharchenko commented 7 years ago

Multiple views: Sten, I pick up two distinct usage approaches in your response:

a loom file should be used to store a single "branch" of results (i.e. focus on a particular normalization, specific clustering scheme, specific embedding); or
a loom file could contain multiple result branches (e.g. alternative clustering results, derived from different treatments etc.). Which one do you think we should strive for here?

On the idea of using naming convention (e.g. ClusterID_{something}): I think this would work well for marking the method by which the clustering was determined, but is probably too relaxed for referencing the source data. If loom is storing multiple layers of transformed or normalized data matrices, these have explicit paths to them. I think it would be better if we could reference data path(s) in association with results such as clustering or embeddings. Not sure how to do this easily in hdf5.

P.S. agree on the rest.

slinnarsson commented 7 years ago

The way we use Loom files is as (1) "a single branch". Our pipeline is based on Luigi from Spotify. You define a "task" which takes one or more input files and produces a single output. You string together tasks using these dependencies, and Luigi figures out what needs to be run, and what tasks can be run in parallel. (Luigi actually doesn't fit well with Loom because it tends to always want to make a new file for every step. This is something we'd like to fix.)

I would probably vote for going with one file, one branch, but maybe with concessions for multiple analyses in specific cases where it makes sense.

slinnarsson commented 7 years ago

Data types:

Loom has so far been very restricted in the data types allowed for attributes. The spec says float64 and string (fixed-length ASCII) only. This is to maximize compatibility.

For strings, I propose to stick to ascii, but to use XML entity encoding (e.g. things like ä). This allows you to pass Unicode through HDF5 while retaining compatibility with MATLAB and other languages that don't speak Unicode. It means that you will need to do the equivalent of html.unescape(x) for each string upon reading.

For numbers, I propose to be more generous, and allow 8, 16, 32 and 64-bit ints, both signed and unsigned, as well as both 32 and 64-bit floats. I propose these same types would be allowed for the matrix layers (but no strings).

We store bools as numbers (1 and 0) currently, and I propose they would be unsigned bytes. This works out of the box with both MATLAB and C/C++.

JobLeonard commented 7 years ago

I think I will stay out of most of the discussion about structure and naming conventions, since I do not know what the standards and typical use-cases within the field are.

However, it would be nice if the conventions are easy to automate in scripts. I'm pretty sure we were all already thinking along these lines, but I want to point out what this means for me.

The loom-viewer currently detects attributes that fit the conventions used in Sten's group, and adapts the interface to that for convenience:

Here it detected various attributes associated with X/Y position (in this case, previous "deprecated" conventions from before they "stabilised").

So good naming conventions/schemes will result in the loom viewer being more convenient to use too!

Regarding naming schemes: how about treating _ as special "separator" symbol that signifies a relationship? This would only apply to the "keywords" of the naming conventions, of course, since people might also just use it as a surrogate for space to ensure compatibility. (I may be just repeating what Sten suggested about naming schemes in a different formulation)

A question is whether this relationshipship would be directed: {a}_{b} could mean "a, subgroup b". It could also allow for nested grouping, which could be useful for building hierarchies: Cluster_Layer1_ByLouvain (I assume one want to group metadata by layers).

(I'll stop here before I propose something that ends up making the naming schemes Turing complete)

Orientation: note that this issue extends beyond loading data from loom files themselves: if each other package uses columns for genes in Python, that would require a transpose each time we move data from one into the other. That is both slower and more error-prone. I do not know what the actual conventions in other packages are, but if I understand @falexwolf correctly the norm in software is column-oriented?

JobLeonard commented 7 years ago

I forgot to mention one convention used in the loom-viewer:

For both row- and column- attributes, a "fake attribute" called (original order) is introduced in the viewer, which is just a set of ascending indices from 1 to row- or column-length. The viewer is very flexible in how it displays sparklines, and (original order) is used to allow sorting data back to the original order it came in:

sorting

Let's call this a pseudo-attribute: it is metadata about the rows and colums, but not actually part of the loom-file itself.

The choice of using ( ... ) in the label stems from a time when only alphanumeric characters and underscores were allowed for attribute names. For a while we used to have another one too, I forgot the name.

We could put (original order) on a list of reserved attribute names, but perhaps it could also be useful to make it a convention to avoid keys wrapped in parenthesis in general. Other people start to write tools for loompy, they might have similar needs for pseudo-attributes, and this is a simple rule to explain. It also doesn't seem likely to interfere with other symbols that platforms might use to mark automatically generated data (an _ or # prefix, that kind of thing)

falexwolf commented 7 years ago

Two quick comments:

I'm also favoring the "single branch" perspective. But maybe we have simply not managed to come up with a robust solution for the bookkeeping that Peter mentions. In any case, the "single branch" is much easier to realize.
Column vs. row access: I don't know for hdf5, but I also guess compression should be the bottleneck. For storing matrices in memory: C stores entries of columns contiguously. Fortran stores entries of rows contiguously. Hence, in C, accessing columns is faster than accessing rows. The other way round in Fortran. Numpy's default is C, I guess the same holds in R. But per-gene operations and parallelization shouldn't be in principle more frequent than per-cell, so this should not be of too much importance... But, if we really care about performance, transpositions are important. Hence, and as Peter seems to have the same convention (even if it's just for speed reasons via a web browser) I'm still advocating that rows store observations/samples/cells and columns variables/features/genes... ;)

JobLeonard commented 7 years ago

For storing matrices in memory: C stores entries of columns contiguously. Fortran stores entries of rows contiguously. Hence, in C, accessing columns is faster than accessing rows. The other way round in Fortran.

It's the other way around; per the wikipedia article:

Row-major order is used in C/C++/Objective-C (for C-style arrays), PL/I,[3] Pascal, Speakeasy[citation needed], SAS,[4] and Rasdaman[5].

Column-major order is used in Fortran, MATLAB,[6] GNU Octave, S-Plus,[7] R,[8] Julia,[9] Scilab, and Eigen[10].

As far as Numpy goes, it supports both (but defaults to C), so we're not really bound to one or the other:

order : {‘C’, ‘F’}, optional

Row-major (C-style) or column-major (Fortran-style) order.

falexwolf commented 7 years ago

Edited: sorry, yes: row-major means rows are stored contiguously and can be retrieved quickly, column-major means columns are stored contiguously

JobLeonard commented 7 years ago

row-major means columns are stored contiguously

I'm sincerely not sure if that is how you would express this or not, but either way we can conclude that natural language is confusing and ambiguous, that we agree on what we mean, and we're aware of that now. I think that is what matters most. ;)

Anyway, back to the actual orientation discussion. My usual disclaimer: I do not do actual research with data. Since I do not know how the data is used in practice, I have no idea what would be the more logical orientation.

However, I do have some ideas about what matters when deciding, and I hope those considerations are useful for the discussion. I won't touch the traditions/conventions argument, which should be considered but is obviously outside of my knowledge domain.

1: Should Genes or Cells be the major order?

First thing to note: whether genes or cells should be the major order is a separate question from whether we should make genes/cells row- or column- oriented. The former is about how the data is used. The second depends on the answer to the first question, and whether the platforms the data is used on are row- or column-major order.

The first question is obviously the more important one, and answering it requires more than just determining which data is accessed more often.

Types of data access

The kind of access matters as much as how often it is accessed: it is faster to access 64 rows in sequence in one go, than accessing 64 random ones. We can distinguish random access vs whole blocks. Major order matters for both, but it matters more for random access than block access.

From the comments made (and after a quick chat with @gioelelm) I gather that:

genes are more likely to be individually accessed (more random access)
cell access is more relevant for clustering algorithms, machine learning, etc, more likely to involve whole datasets in block access

(I could be way off here, please correct me if I am)

Assuming that this interpretation is correct, that would suggest that gene major order would benefit more.

Data size

Another thing to consider is relative number of cells versus genes, and size of the data set.

I'm not a biologist, but as I understand, the number of genes is more or less fixed per organism (for example ~29k for the mouse data I've seen in Sten's group). The number of cells varies, and currently most sets I have seen in the group are less than 29k. However, this number will grow rapidly in the future, and into the millions.

If we have to choose, I would argue it is more important to optimise for the larger datasets (so total cells >>> total genes), because of the relative greater impact of performance improvements. Improving performance by 10% on a small data set might save minutes at best, whereas that 10% performance increase might save hours on larger data sets.

When total cells >>> total genes, cell major order would have a significantly more negative impact on gene access, than gene major order would have on cell access. This again speaks in favour of genes as major order. This applies mainly to datasets loaded in memory though, because of the next point.

Loom files are chunked, so major- and minor order should not matter regarding disk access; I just looked at the documentation, which had this to say: "The chunks of the dataset are allocated at independent locations throughout the HDF5 file and a B-tree maps chunk N-dimensional addresses to file addresses."

The HDF5 groups does suggest making chunk shapes asymmetrical when you know the data has a preferred order. In our case, we might want to make the chunks that contain more cells than genes. This would reduce disk access when acessing genes, and have relatively little cost when accessing cells (again assuming cells >>> genes).

2: After answering the previous question, what does that mean for orientation?

Let's say the answer to the previous question is that it makes more sense to make genes the major order.

Python is row-major order. Numpy defaults to row-major, but can support column-major. However, every other language used in the sciences is column major (Fortran, R, Matlab, and though nobody here seems to use this beautiful language: Julia).

I suspect that effectively, Fortran matters for libraries powering the rest (BLAS, LAPACK), and we shouldn't worry about it. Matlab I recall being used more physics and less in biology, so probably not that big of a deal either.

The big two contenders seem to be Python (with Numpy) and R.

We could make genes column-aligned and force all of our Numpy code to use column-major order, which would make things consistent, but that would lead to terrible performance with other python packages that expects row-major data (presumably most).

For a while I considered a silly, somewhat blasphemous idea: let the Python and R loom libraries use whatever orientation would make genes the major order (not the loom file itself, which should have a fixed orientation). This would make genes align with rows for Python, and with columns in R. It would have the convenient consequence that it aligns with the statistics traditions in R, and the genomics traditions in Python (only the Python stats and machine learning packages would be left out), meaning it fits expectations for most people (as long as they work stick to their own platform).

Of course, this would lead to confusion when you use both packages, and it would mean row_attributes and col_attributes switch from platform to platform, which would complicate naming conventions. Sounds problematic to me, but maybe it isn't as bad as I think.

I guess in the end the main consideration should be how the Python and R platforms are used: is there more random access, or block access, and more gene or cell access?

Let's say that Python is typically used in a fashion that favours random access, and R in a way that favours whole-dataset block access. In that case, the benefit of making the genes row-aligned should be bigger in general. A similar argument holds if Python does more gene access an R does more cell access.

But I have no idea about what this is like in practice, so I'll leave it at that.

Hope this was somewhat useful, /Job

falexwolf commented 7 years ago

Right, my bad. I meant for row-major that iterating over column indices is fast as it's contiguous. But this means that the row lies contiguous in memory and can be quickly retrieved. I edited the statement above to not keep confusing people...

I think your statements above make sense; I'd say you're right, single access to genes would be more frequent than to cells. Also, I believe you that it's better to have the dimension with more entries (cells, as nr. cells >> nr. genes) be the minor order; so genes as major order; I'll make some experiments with that... ;)

However, I'd say one should make the decision based on simple, clear conventions that should be the same between languages... I think both Python and R have the flexibility of interpreting arrays in memory as column- or row-major (as it's just a matter of the indexing convention) and hence, transpostitions can be avoided.

If loom files are ordered with genes as rows, Scanpy will be able to efficiently cope with that. Nonetheless, I think Scanpy should not stand against the conventions of the whole Python ecosystem and the stat/ML books; it will continue to present variables (genes) as columns to the user and observations/samples (cells) as rows, even though the notions "columns" and "rows" are never used - instead, after a long discussion, we decided for the names "var" for variables and "smp" for samples (maybe "obs" for observation in the future) to avoid running into these convention issues. Looking into the loom file, people might then get confused as things are called "rows" and "cols", but this might be something to live with.

flying-sheep commented 7 years ago

Hi, I’m Philipp – also from Fabian Theis’ lab – and I designed AnnData with Alex.

I wrote a little piece on why we need to stop talking about “row annotation” and “column annotation”. There’s only cells and genes, rows and columns are exclusively an implementation detail.

Because of this, the vital consideration about “how to transpose our matrix” is simply about what performs better – “gene major” or “cell major”.

Now there’s two considerations:

reading/writing speed of the HDF5 file and storage size. there’s a difference for sparse data.
algorithm speed. I assume e.g. kNN graph creation works better with cells in the major dimension, while @pkharchenko said that they do many per-gene tests, preferring genes to be there.

I propose:

We check which layout has the better size and read/write speed for a high cell number and lower gene number.
We do a real-world benchmark of multiple workflows, once with genes in the continuous memory dimension and once with cells, and look at the results.

Only then can we make an informed decision what to choose. And only a real-world benchmark can tell us the truth: Neither microbenchmarks nor theorizing over performance will necessarily be useful.

JobLeonard commented 7 years ago

Also, I believe you that it's better to have the dimension with more entries (cells, as nr. cells >> nr. genes) be the minor order; so genes as major order; I'll make some experiments with that... ;)

Well, to my shame you took the one thing for granted where I accidentally swapped things. Or rather... it's complicated...

Let's just do the experiments ;)

I wrote a little piece on why we need to stop talking about “row annotation” and “column annotation”. There’s only cells and genes, rows and columns are exclusively an implementation detail.

Well, this would make the per-platform suggestion more appealing. However, there was some hope that Loom might useful in other contexts than single-cell data, and then having the labels fixed to "genes" and "cells" might be problematic. This could be fixed by some convention for labelling the major/minor axis, but that again introduces extra syntax.

Also, maybe I'm missing something, but I don't think rows and columns are "exclusively an implementation detail".

Currently, when describing an m by n matrix, m is always the rows and n always the columns. Currently in loompy, if one requests ten genes, one gets a 10 x total cells matrix, and when requesting ten cells, loompy returns a total genes x 10 matrix. So users will learn to expect that genes are rows.

Say we avoid that by returning a 10 x total genes matrix in the second case. For R, being column-major, this would be total cells/genes x 10. But even then: users will learn that the major order of whatever is requested is rows for Python, and columns for R. Either that or we have the same problem of one of the not utilising major order again. On top of that it and it adds an extra transpose that might be unnecessary.

My point is: no matter what, some association with "rows = genes/cells" or "rows = major/minor order" seems inevitable. But again, given my lack of hands-on experience with the data I might be missing something.

reading/writing speed of the HDF5 file and storage size. there’s a difference for sparse data.

I have actually spent quite a bit of time trying to tune the HDF5 performance of loom files (see here and here), looking for the best trade-off in terms of compression size and disk access speed.

While the "high level" questions regarding orientation fit here, but I'm worried diving further into these technical details will make the discussion wander off-topic. So, I opened a separate issue for this low-level aspect. Shall we continue the performance-considerations and testing there, while keeping the "user expectations"/traditions discussion here?

flying-sheep commented 7 years ago

If you want to be more general, we can talk about “observations” and “features”, so that’s a bit of a tangent I guess.

Loom is a data format. Its job is to contain the data with the optimal memory layout for what it’s used for. If cell/observation-major is faster and smaller, we do that, or the other way round.

Everything else is API design. We will just return an object of a class that has the dimension order we want. What I mean is that e.g. CSC and CSR matrices contain the exactly same data… but for one you access the major dimension with mat[i, :]/mat[i, ] and for the other with mat[:, i]/mat[, i], respectively.

The only problem is dense data: R has a built-in matrix class that is column-major. But I don’t know if the other languages have this strong binding. Numpy can do both via np.array(..., order='F' | 'C')

falexwolf commented 7 years ago

@flying-sheep: please, let's do observations/variables or samples/features. The first is more common in stat, second more common in ML. My impression is the first pair causes the least confusion for bio-related users. Hence, in AnnData, we might consider transition from "samples" to "observations".

@JobLeonard: this is the most general as you can get in any data science problem, right? rows and cols, by contrast, are notions that are not specific to data and suggest that both dimensions play the same roles, as in a graph adjacency matrix, for instance

slinnarsson commented 7 years ago

But the mapping from genes/cells to observations/variables depends on what you want to do and can’t be known in advance. If you are clustering cells, the cells are the observations but if you’re inferring gene regulatory networks, the genes are the observations.

One of the goals of Loom was exactly to allow both dimensions to play the same role. This is different from almost every other API out there, which enforces a different API for each dimension. It is one of the great frustrations when trying to store omics data in databases, for example.

Transposing a matrix is fast relative to loading data from disk so let’s not worry about the performance aspects of the orientation of the file. On disk, Loom is block-compressed so rows and columns are (ought to be?) equally fast to access.

I haven’t heard any very compelling arguments for any particular orientation, so we will stick with genes in rows. This will save us many months of bugs since all our existing code is based on this convention.

-- Sten Linnarsson, PhD Professor of Molecular Systems Biology Karolinska Institutet Unit of Molecular Neurobiology Department of Medical Biochemistry and Biophysics Scheeles väg 1, 171 77 Stockholm, Sweden<x-apple-data-detectors://1/0> +46 8 52 48 75 77<tel:+46%208%2052%2048%2075%2077> (office) +46 70 399 32 06<tel:+46%2070%20399%2032%2006> (mobile)

24 nov. 2017 kl. 20:23 skrev Alex Wolf notifications@github.com<mailto:notifications@github.com>:

@flying-sheephttps://github.com/flying-sheep: please, let's do observations/variables or samples/features. The first is more common in stat, second more common in ML. My impression is the first pair causes the least confusion for bio-related users. Hence, in AnnData, we might consider transition from "samples" to "observations".

@JobLeonardhttps://github.com/jobleonard: this is the most general as you can get in any data science problem, right? rows and cols, by contrast, are notions that are not specific to data and suggest that both dimensions play the same roles, as in a graph adjacency matrix, for instance

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/linnarsson-lab/loompy/issues/19#issuecomment-346886094, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AKKag32bJlbxM9cEDb6iyVedt0wdkYMFks5s5xewgaJpZM4QePCP.

JobLeonard commented 7 years ago

That still leaves the question if HDF5 works as advertised, and potential performance issues there. But issue #22 is exactly about that, so we need not further discuss that here.

slinnarsson commented 6 years ago

Picking up the thread again, I'm happy to announce Loom 2.0, an almost complete rewrite of loompy. Go to the release notes for the full list of changes. In short, v2.0 implements many of the features requested above (such as Unicode support, multidimensional attributes, and multidimensional global attributes).

It also supports a powerful new concept of in-memory views (essentially a slice through the file, including all layers, attributes, and graphs) which works great with the new scan() method for out-of-memory algorithms.

It is much more Pythonic, with a uniform API for reading/writing/deleting attributes, graphs and layers.

It is more generous (allowing almost any kind of list, tuple or array to be assigned to attributes, or any kind of sparse or dense adjacency matrix for graphs). At the same time, it normalizes everything to conform to the file spec.

Finally, I have updated the file format specification to be more specific about the allowed datatypes and shapes.

All this, and it remains fully backwards compatible with your old loom files. I have used 2.0 for a couple of months already and it's a significant step forward. Your code will be simpler, more expressive and more capable.

Note: your current code should mostly work without problems. You'll get log messages for deprecated methods and attributes, but they can be safely ignored. There are two breaking changes, both mentioned in the release notes. After fixing instances of these two issues, our current analysis pipeline runs without error and uses nearly every feature of loompy 2.0, so I would consider this a reasonably stable release. Out-of-memory manifold learning and clustering of a 500k cells by 27k genes dataset with all the bells and whistles runs just fine on my MacBook with 16 GB RAM (admittedly, it takes a few hours)!

I've updated the docs, which are now (for technical reasons) hosted at http://linnarssonlab.org/loompy.

falexwolf commented 6 years ago

Thank you for this, Sten! :) Just before Christmas, Scanpy/AnnData incorporated .loom support: https://scanpy.readthedocs.io, (docs read, docs write, code read, code write) Unfortunately, still with the old loom version. We will work on making use of the possibilities of the new version, which will allow loosing less information when exporting to .loom.

What might be interesting for all efforts of storing sparse matrices in hdf5: I've worked on a more basic way of interfacing sparse matrices in hdf5 files, which works a bit different than in loom, but provides some advantages and additional functionality.

We mentioned before that we planned to move from a static to a dynamically-backed AnnData object from the very beginning. We did in the past weeks. AnnData now offers both: convenience of fancy indexing and all what pandas/numpy offer for accessing and manipulating data ("memory mode") versus full hdf5-backing ("backed mode"), with reduced speed and possibilities of accessing and manipulating data. See my thoughts on this here. Given that loom didn't satisfy all that we need for backing AnnData objects, we currently base this on a very similar hdf5-format (but with in addition support for categorical data types, arbitrary unstructured annotations, hdf5 sparse datasets), but will be happy to move away from this, once loom can fully back AnnData objects.

PS: A revised version of the Scanpy draft will appear in Genome Biology soon and we reference and acknowledge the discussion here and Sten's comments there. It really helped us rethink AnnData's hdf5-backing. PPS: I haven't been able to post this neither here nor at another place earlier because my wife and I just got twins... 😄 So, I will be slow in responding. Maybe @flying-sheep can help out here and there, regarding anndata and Scanpy...

slinnarsson commented 6 years ago

Congratulations on the twins! Should keep you very busy :-)

It would be great if you or @flying-sheep could have a look at loompy 2 and let me know what is missing for you to use it as backing for AnnData. I'm not sure I understand the specific requirements for categorical datatypes (we just use strings or integers typically). But I think/hope loom now supports your notion of arbitrary unstructured data (let me know if not!).

As for sparse data, I'll take a look at your work and think a little more. Storing as CSC will probably be much faster along columns (compared to loom) and much slower along the rows. Can you append data to such files without loading all the data? But I'm wondering if we could have a concept of "freezing" a file, after which data can no longer be appended. Before freezing, we could use block-compressed (or COO). Upon freezing, we could create both the CSC and the CSR representation, for fast access along both axes. I think that would work well as we usually have separate phases of (1) gathering all the data into a single loom file and (2) running algorithms on that.

falexwolf commented 6 years ago

Thank you, Sten, and a happy new year to everybody!

Of course, we will look into loompy 2! We will gradually improve the interface with anndata and converge to something that is mature with clear and powerful conventions.

Regarding sparse data: it's not so much about CSC vs. COO but about a unique definition of attributes that enable the definition of a "SparseDataset" class that allows treating sparse data like a usual h5py Dataset (see here for a draft of the code, see my blog post linked above); I'll also talk to the 10x Genomics developers about this. Appending is only possible in one direction for CSC and CSR. I'm not sure whether this is enough. 10x Genomics currently uses CSC. It's important to not having to load a whole Cell Ranger file into memory just to create a loom or an AnnData backing file.
With "arbitrary unstructured annotation" I just mean dictionaries, which are simply mapped to HDF5 groups in anndata. Let's see if loompy can do the same - my current understanding is that you use HDF5 group attributes for this purpose.
With "categorical data type" I mean the pandas categorical data type, very similar to R's "factor". Before, we used integers and strings, too; but the advantages of a proper categorical data type outweighed the disadvantages of having a few lines of additional code for the HDF5 backing and of using pandas as a dependency for AnnData objects. It all boils down to having to arrays: one "codes" of length nr. observations and another "categories" of length nr. categories.

JobLeonard commented 6 years ago

I have made a separate issue for keeping track of real file changes, and one potential solution would be a convention for keeping track of such information inside the actual loom file (although there might be other ways). Please have a look #26