scverse / spatialdata

An open and interoperable data framework for spatial omics data
https://spatialdata.scverse.org/
BSD 3-Clause "New" or "Revised" License
215 stars 42 forks source link

Tool to migrate data between SpatialData versions #680

Open aeisenbarth opened 1 month ago

aeisenbarth commented 1 month ago

Is your feature request related to a problem? Please describe. When the specification of SpatialData is changed, existing datasets do not have these changes, and under circumstances may even become incompatible.

For the in-memory representation, the library's reader functions support reading older versions (in most cases). However, users may want to upgrade the on-disk data to the latest version. Another case is when something is not covered by backward-compatibility of readers, e.g. due to errors in the data (#655).

Describe the solution you'd like As discussed earlier, we want a tool for migrating data, if more format changes occur in future. I open this new issue for tracking this, since #655 was too specific and is closed.

Describe alternatives you've considered

LucaMarconato commented 1 month ago

Thanks for tracking this here. An alternative to consider opening some GitHub discussions for less common issues, as done here: https://github.com/scverse/spatialdata/discussions/657.

For more common conversion steps, like migrating from the ShapesFormatV01 (Zarr ragged-array geopandas representation) to ShapesFormatV02 (GeoParquet), a migration tool would indeed be preferable.

aeisenbarth commented 1 month ago

Current format changes:

Current versioning support in SpatialData:

Types of changes in SpatialData format:

Existing tools:

Aims:

Requirements:

LucaMarconato commented 3 weeks ago

Thanks for the detailed summary of the format changes. I would proceed as follows.

I would not include these three points in a migration tool:

For the example above this would roughly be: "Open a terminal and move table into tables, or if both are present, manually open them (they are standard AnnData objects) and choose which one to keep (or merge them). In doing that check that the metadata keys and the region_key, instance_key columns are the one you need".

I think that having a migration tool dealing with the first 3 problems would be complex and it's better to explain to the user what the problem is and how to build a solution, so that they know what happens and they can choose a solution that suits them. For instance, there is no canonical way to fix missing radii because they were negative, so I'd let the user manually choose how to address them.

What do you think about this way to proceed?

giovp commented 1 week ago

super useful summary @aeisenbarth , I also agree with @LucaMarconato that a separate tool is maybe too much of an overkill. I think the best way would be to stick to the format version as much as possible, and reflect this in the code, without changing the API.

(?) Ensure reader can read all old data (https://github.com/scverse/spatialdata/issues/655, https://github.com/scverse/spatialdata/issues/624) that has not been migrated.

I think this ideally would be true, we should strive to make this possible imho.