Open ivirshup opened 1 year ago
While I see the argument in favor, I think we should keep only DaskDataFrame
for the moment, for the following reasons:
read_zarr()
returns immediately. Most of the times the user may not need points. DataFrame
for GeoDataFrame
read_zarr(load_images_in_memory=True, load_points_in_memory=True)
etc., or a function `sdata.load_element_in_memory('element_type', 'element_name'), but we need to understand the implications of this with the io. Since the workflow with io is still to be discussed (as we saw the in this pr https://github.com/scverse/spatialdata/pull/138), I would wait until we have that clear before making the change described in this issueDeveloping this point more, I think that the best would be to have everything lazy by default, even the shapes and the table, and then loading things in-memory if asked explicitly by the user, or if needed by an algorithm.
@gtca in postdata
everything is lazy by default right? If I remember correctly if the user load something then it persists in memory, is it correct?
the xenium points fit in memory but loading is slow
This doesn't fit my experience. The _get_points
function from spatialdata-io
is fairly slow, but read_parquet
seems quite fast.
%%time
transcripts_pq = pd.read_parquet(path / XeniumKeys.TRANSCRIPTS_FILE)
CPU times: user 6.37 s, sys: 1.7 s, total: 8.07 s
Wall time: 1.88 s
%%time
transcripts = _get_points(path, specs)
CPU times: user 20.8 s, sys: 2.64 s, total: 23.4 s
Wall time: 17.6 s
we may anyway replace Dask DataFrame for GeoDataFrame
I don't think this is a realistic option unless GeoDataFrame starts supporting 3d. Which is not on any roadmap as far as I can tell.
read_zarr(load_images_in_memory=True, load_points_in_memory=True)
This is what we're trying to do with read_dispatched
in anndata. Could be useful to link spatial data up with the anndata IO system at some point.
I think it would be useful to have an API like read_zarr(load_images_in_memory=True, load_points_in_memory=True)
AnnData has .to_memory
as well as an equivalent internal singledispatched
to_memory
function which we could export.
I see the speed improvements but I think that 1.88 seconds is too slow, especially with datasets that can have many more points (like MERSCOPE, see zulip message; there are 10x more points in general for a single dataset) or in the future when people will have multiple Xenium slices. I think we want the SpatialaData
object to load immediately (lazy loading).
The .to_memory()
function would be great, it would be worth try an implementation soon with that, so the user can load large portions of the data before long processing computations.
For many use cases (like xenium) points can be handled completley in memory without issue. Given that, and all the reasons the first "best practice" in the dask dataframes documentation is "use pandas", I think it's worth supporting pandas dataframes.
They have matched APIs, so it's possible many functions would not need modification to support both. For functions where we require one or the other, we can use a
sklearn.utils.check_array
style utility to do coercion.