Let PointsModel be a dask or pandas dataframe

ivirshup commented 1 year ago

For many use cases (like xenium) points can be handled completley in memory without issue. Given that, and all the reasons the first "best practice" in the dask dataframes documentation is "use pandas", I think it's worth supporting pandas dataframes.

They have matched APIs, so it's possible many functions would not need modification to support both. For functions where we require one or the other, we can use a sklearn.utils.check_array style utility to do coercion.

LucaMarconato commented 1 year ago

While I see the argument in favor, I think we should keep only DaskDataFrame for the moment, for the following reasons:

the xenium points fit in memory but loading is slow (as discussed here https://github.com/scverse/spatialdata/pull/141), same for cosmx etc. I like the fact that read_zarr() returns immediately. Most of the times the user may not need points.
we may anyway replace Dask DataFrame for GeoDataFrame
I think it would be useful to have an API like read_zarr(load_images_in_memory=True, load_points_in_memory=True) etc., or a function `sdata.load_element_in_memory('element_type', 'element_name'), but we need to understand the implications of this with the io. Since the workflow with io is still to be discussed (as we saw the in this pr https://github.com/scverse/spatialdata/pull/138), I would wait until we have that clear before making the change described in this issue

Developing this point more, I think that the best would be to have everything lazy by default, even the shapes and the table, and then loading things in-memory if asked explicitly by the user, or if needed by an algorithm.

LucaMarconato commented 1 year ago

@gtca in postdata everything is lazy by default right? If I remember correctly if the user load something then it persists in memory, is it correct?

ivirshup commented 1 year ago

the xenium points fit in memory but loading is slow

This doesn't fit my experience. The _get_points function from spatialdata-io is fairly slow, but read_parquet seems quite fast.

%%time
transcripts_pq = pd.read_parquet(path / XeniumKeys.TRANSCRIPTS_FILE)

CPU times: user 6.37 s, sys: 1.7 s, total: 8.07 s
Wall time: 1.88 s

%%time
transcripts = _get_points(path, specs)

CPU times: user 20.8 s, sys: 2.64 s, total: 23.4 s
Wall time: 17.6 s

we may anyway replace Dask DataFrame for GeoDataFrame

I don't think this is a realistic option unless GeoDataFrame starts supporting 3d. Which is not on any roadmap as far as I can tell.

read_zarr(load_images_in_memory=True, load_points_in_memory=True)

This is what we're trying to do with read_dispatched in anndata. Could be useful to link spatial data up with the anndata IO system at some point.

I think it would be useful to have an API like read_zarr(load_images_in_memory=True, load_points_in_memory=True)

AnnData has .to_memory as well as an equivalent internal singledispatched to_memory function which we could export.

LucaMarconato commented 1 year ago

I see the speed improvements but I think that 1.88 seconds is too slow, especially with datasets that can have many more points (like MERSCOPE, see zulip message; there are 10x more points in general for a single dataset) or in the future when people will have multiple Xenium slices. I think we want the SpatialaData object to load immediately (lazy loading).

The .to_memory() function would be great, it would be worth try an implementation soon with that, so the user can load large portions of the data before long processing computations.

scverse / spatialdata

Let PointsModel be a dask or pandas dataframe #153