Efficiently storing millions of spots

tischi commented 1 year ago

@d-v-b following up on our discussion in zoom just right now: would you have a recommendation for how to store the output of a spatial-omics analysis, i.e. millions of spots, where a spot has one (or in our case several) 2D or 3D coordinates, a gene name string, and maybe some additional properties such as "detection quality". I think requirements would be column-wise (and row-wise) chunked loading from a file system but also from "the internet" (maybe S3 object store).

will-moore commented 1 year ago

That sounds like what's being proposed in #64 using AnnData. See sample preview at https://deploy-preview-20--ome-ngff-validator.netlify.app/?source=https://minio-dev.openmicroscopy.org/idr/temp_table/test_segment.zarr/tables/regions_table and in Vitessce at http://vitessce.io/#?edit=false&url=https://minio-dev.openmicroscopy.org/idr/temp_table/vitessce_anndata_config.json

That's not millions of points (we're loading everything at once) but it's zarr-based and on s3 so scaling up shouldn't be too hard.

d-v-b commented 1 year ago

I'm not an expert in that kind of data, so beyond "not in a zarr array", I can't give very authoritative advice. What I can say is that to me this sounds like a big table where one of the columns has spatial meaning (the coordinates), and i'm guessing you would probably be interested in doing spatial queries, so I would look at a database that supports those kinds of queries (a "spatial database"). Here's an embedded sql database, and a resource with more information about this type of database. Several of these are embeddable, which means you don't need to start up a server to get the features of the database.

tischi commented 1 year ago

@will-moore to give a bit more context: @d-v-b mentioned that there may be alternative interesting options to store such data, other than zarr.

d-v-b commented 1 year ago

to be more specific, I think zarr is not a good choice for storing tabular data, given that much better solutions exist.

jkh1 commented 1 year ago

People working with LIDAR data already developped file formats for gigantic point clouds. The standard format is .las (with .laz the compressed version). See https://www.asprs.org/divisions-committees/lidar-division/laser-las-file-format-exchange-activities. This is supported by the lidr package in R and the laspy module in python for example.

EDIT: Reading and writing las and laz files in R is done by the lasr package, lidr adds manipulation and visualization.

will-moore commented 1 year ago

So, if we go for some non-zarr solution, would that be in addition to the tables spec proposed at #64? E.g. use AnnData for "smaller" tables and something else for massive tables?

Currently, AnnData has good support in Python and reasonable support in JavaScript, but little/none in Java. What are the alternatives, are they "cloud friendly" and what's the language support like? Are the solutions generally "point cloud" formats (3D points), or something more generic? Apologies for all the questions, but I'm not really familiar with this field.

d-v-b commented 1 year ago

For tabular data in general, there are a lot of storage options -- see all the IO routines associated with a pandas dataframe for a sense of the scope. I'm not sure how well any of these formats scale with data size, but I would guess for enormous tabular datasets people go for databases.

More broadly, can someone explain why ngff should contain any table specification at all? From my point of view, storing tables seems completely orthogonal to the problem of storing images. I think we should focus on the latter, unless somehow there's conflict / interaction between the image spec and representations of tables.

jkh1 commented 1 year ago

@d-v-b As an example, consider super-resolution microscopy produces point clouds which can be and almost always are rendered as images. We may want to store the original point data (which contains more info) with the corresponding/derived images. Another example is that we may want to store extracted features with the corresponding segmentation masks. @will-moore In my opinion, AnnData also doesn't have good support in R (wrapping the python module doesn't count in my view because that's not robust). I am also not convinced by the AnnData format itself. It looks to me like a poor attempt at a relational database management system. In this regard, I agree with Davis. If we need to keep track of relations with indexed tables, we should use something like sqlite which has much better support across the board. Regarding point cloud data, I know las files are stored and accessed from s3 buckets but I don't have experience with this (but see also entwine). As for language support, I was going to mention the libLAS C++ library but it turns out it's being deprecated in favor of PDAL which has a python integration and Java bindings. There's also lastools, a collection of small command line tools. I am not suggesting to go with LAS, merely pointing out that other people/fields already deal with large point cloud data and have developed tools for them. In this context, I also forgot to mention the C++ PCL library which has it's own pcd file format. There are also older file formats developed primarily for 3d graphics such as PLY, OBJ, VRML that could be repurposed.

d-v-b commented 1 year ago

As an example, consider super-resolution microscopy produces point clouds which can be and almost always are rendered as images. We may want to store the original point data (which contains more info) with the corresponding/derived images. Another example is that we may want to store extracted features with the corresponding segmentation masks.

I completely understand wanting to associate tabular data with imaging data, but I don't understand the need to store the tables and the images in the same zarr container. The entire point of a cloud-friendly format like zarr is that you can address resources like arrays with a url or a path. So your tabular data can reference images stored in zarr via the path to those images. There's no performance advantage to storing the tables and the arrays in the same hierarchy. On the contrary, it adds a complication -- suddenly your tabular data reader has to keep up with the zarr api, which could change.

I maintain many tens of terabytes of chunked arrays on s3, and I have a tabular representation of them (in a postgres database). I have never once wanted to store the arrays and the tables in the same place. So I don't really understand this urge to bake a tabular representation into ome-ngff.

d-v-b commented 1 year ago

and if you really really want your tabular data to be stored with the zarr container... put the tabular data in the same folder / prefix as the zarr container, and use relative paths in the tabular representation to refer to zarr resources. But putting the tables inside the zarr container is adding a lot of complexity for no obvious gain.

jkh1 commented 1 year ago

Yes, it's a matter of scoping, i.e what do we want to cover with ngff? One advantage that could come with having more than images in the container is the standardisation that inclusion in the specs could bring to some commonly used image-related data.

joshmoore commented 1 year ago

Lacking the time today with 4 hours of just NGFF calls, I'm going to do something that I'd prefer not to and make some fairly off-the-cuff meta observations:

I assume this conversation isn't (currently) reaching the people you want it to, so minimally, please don't assume silence implies agreement.
I'm worried that other viewpoints (incl. those already expressed in existing issues) aren't being considered.
I personally don't think some of the language used in recent comments is necessary.

The fact is that, though having feedback and alternatives is great, this is coming very late in regard to the tables spec (and, elsewhere, the transforms spec). Some of the comments above (and elsewhere) appear flip when measured against the 12-18 months already invested.

Of course if there are blockers they need raising, but I'd urge everyone to keep in mind that the current specifications need not be the final word. Proposals for further changes especially with prototypes of the data, implementations, and specs remain very welcome.

will-moore commented 1 year ago

cc @kevinyamauchi

kevinyamauchi commented 1 year ago

Hello everyone! Thank you for the discussion. I apologize if any of my responses below are terse - I am a bit short on time at the moment.

scope of the tables spec PR

The original motivation was as @jkh1 described above - to allow users to store their extracted features/measurements with their image data. Based on positive feedback from the October 2022 community meetings, the scope of the spec is a table that annotates label images (see this comment and the PR main post). The current spec is not defining how to store large point clouds. Recognizing that there are many ways to store tabular data and the best one is dependent on the access pattern/use case, we intentially scoped this to just "label tables" as a way to address a user need and begin testing tables in the NGFF.

should NGFF store tables?

I suppose this is the first question that needs to be answered. @d-v-b makes some good points above in terms of benefits to associating tabular data rather than storing it with the image data. On the other hand, as mentioned above, some people want to store tabular data with their images in the NGFF. My understanding based on the Sept 2021 NGFF spatial omics hackathon, Jan. 2022 NGFF hackathon, community meetings, and requests for comment is that tabular data is within scope for the NGFF. To be honest, it is disheartening to see us return to this fundamental question 18 months after the work started. That being said, I agree with @joshmoore that if there are major blockers, we should explore those.

table format choice for large point clouds

As mentioned above, #64 does not specify how to store large point clouds. There are many tabular formats specifically designed for performant point cloud rendering and/or analysis (some of which have already been mentioned). It is possible these will have to be explored/integrated/associated for large point clouds. However, it feels to me like this should be a separate discussion from #64, as my understanding is that #64 is not required to be the only and final table format and that we may have to consider other options for other use cases.

d-v-b commented 1 year ago

@kevinyamauchi thanks for your comments, and I understand how disheartening pushback can be after you put in a lot of work. However, despite arriving late in the game, I feel like my concerns are still valid.

Issues with storing tables in zarr

There are many file formats designed specifically for tabular data -- csv files, excel spreadsheets, feather files, parquet files, sqlite, distributed databases -- with much better cross-language support than zarr. The decision to store tables in zarr instead of a standard format instantly creates a barrier to analyze that data with conventional tabular / dataframe tooling. Parquet in particular is a binary format designed specifically for tabular data, with an emphasis on high performance and efficient memory access between different processes, and very broad language / library support. Zarr is worse than parquet on all those metrics. So choosing zarr for tables means storing tables in a format that is both less common and less performant than other options, and taking on the risk that the zarr API might somehow change in a way that inconveniences its use as a tabular storage format. This argues against storing tables in zarr, given all the other options for tabular data.

table format choice for x

Given that #64 does not address how to store point clouds, the extrapolation of the effort in #64 would be for someone to draft a "ome-ngff tables for large point clouds" proposal. And so on for the Nth variant of tabular data, e.g. for trees generated by tracking data, we might see yet another addition to the spec. And then every time one of these table formats change, we would have to also change the ome-ngff spec.

I don't think this evolution of the ome-ngff spec is sustainable or efficient. I think instead ome-ngff should be extremely specific about how images are stored, and extensible and composable with specifications of other types of data. Under this view, if communities want to use ome-ngff to store their tabular data, the onus is on those communities to define how they wish to integrate their tabular format with ome-ngff. This doesn't invalidate the work done in #64, it merely rescopes it -- As the maintainers of a bespoke tabular format, the AnnData community should be responsible for defining how that tabular format can be expressed in an ome-ngff container. And ome-ngff should only be responsible for making this possible, and notifying those communities well in advance when breaking changes are imminent. I don't see how the alternative could work.

will-moore commented 1 year ago

Let's suppose that #64 would be moved as-is from the NGFF spec to the AnnData docs, and the NGFF spec would instead simply refer to that doc and say "The AnnData community uses this specification for storing tabular data in OME-NGFF".

Advantages:

No need to bump the version of NGFF spec.
Future changes to the spec also occur independently of the NGFF spec

Disadvantages:

Less clear to other users of OME-NGFF that the AnnData tables are supported by NGFF community and tools (more likely that such users will do their own thing, when AnnData would actually work fine). Instead of growing the community of "NGFF users who store their tables in AnnData", this could lead to more fragmented solutions that aren't supported in multiple tools.

As an OME developer, I wasn't previously familiar with AnnData, but in working on #64 and support for that spec in ome-zarr-py, napari-ome-zarr and ome-ngff-validator, I think that it can be of value others too.

Maybe more discussion needed, but I'll post this for now - thinking about this takes time and probably needs to involve more than just those on this thread...

keller-mark commented 1 year ago

No need to bump the version of NGFF spec. - @will-moore

It is my understanding in https://github.com/ome/ngff/issues/83 that it is possible to version parts of the spec independently, which would alleviate this concern.

The decision to store tables in zarr instead of a standard format instantly creates a barrier to analyze that data with conventional tabular / dataframe tooling. - @d-v-b

I think an advantage here is storage of the Image and Table in the same format. Since the scope of this proposal is that the tabular data would be tied to the image data (e.g., derived from it), any software implementation that is reading/writing the image would already support Zarr and therefore does not need an additional dependency or much developer overhead to read/write the associated table.

taking on the risk that the zarr API might somehow change in a way that inconveniences its use as a tabular storage format. - @d-v-b

I think this risk is minimal given that the proposed implementation of tables for Zarr is not more than a convention for use of two central features: Arrays and Groups.

if communities want to use ome-ngff to store their tabular data, the onus is on those communities to define how they wish to integrate their tabular format with ome-ngff

I think this implies that there are multiple distinct communities using OME-NGFF which I would argue is actually one: bioimaging. The recent OME-Zarr preprint highlights how NGFF has already enabled so many interoperable tools to be developed around the common spec.

Given that https://github.com/ome/ngff/pull/64 does not address how to store point clouds, the extrapolation of the effort in https://github.com/ome/ngff/pull/64 would be for someone to draft a "ome-ngff tables for large point clouds" proposal. And so on for the Nth variant of tabular data, e.g. for trees generated by tracking data, we might see yet another addition to the spec. And then every time one of these table formats change, we would have to also change the ome-ngff spec. - @d-v-b

While this may not be easy, if there is a clear shared use case within the bioimaging community, it would be valuable for that use case to be reflected in the NGFF spec, so that interoperable tools can be developed on top of it. I think as long as individual tool developers can continue to support a subset of the spec this is fine (e.g., a tool focused on quality control of super-resolution microscopy images would not need to support the part of the spec that defines how a tree for object tracking would be stored). As a developer of one of these tools, I would very much welcome a draft of a "ome-ngff tables for large point clouds" proposal.

ome / ngff