Open aeisenbarth opened 1 month ago
Thanks for tracking this here. An alternative to consider opening some GitHub discussions for less common issues, as done here: https://github.com/scverse/spatialdata/discussions/657.
For more common conversion steps, like migrating from the ShapesFormatV01
(Zarr ragged-array geopandas representation) to ShapesFormatV02
(GeoParquet), a migration tool would indeed be preferable.
Current format changes:
table
to tables
(#298)Current versioning support in SpatialData:
ShapesFormatV02
from element.group.attrs["version"]
), starting with spatialdata>=0.2.2Types of changes in SpatialData format:
Existing tools:
manage.py migrate
)Aims:
Requirements:
\
by \\
" must not be applied if already applied before.) Thanks for the detailed summary of the format changes. I would proceed as follows.
I would not include these three points in a migration tool:
- Multiple tables renamed table to tables (https://github.com/scverse/spatialdata/issues/298)
- Circle radii must be finite numbers (https://github.com/scverse/spatialdata/issues/655)
- Names must match naming constraints (https://github.com/scverse/spatialdata/issues/624)
For the example above this would roughly be: "Open a terminal and move table
into tables
, or if both are present, manually open them (they are standard AnnData
objects) and choose which one to keep (or merge them). In doing that check that the metadata keys and the region_key
, instance_key
columns are the one you need".
Zarr -> Parquet
change (the only one that is reflected into the format).I think that having a migration tool dealing with the first 3 problems would be complex and it's better to explain to the user what the problem is and how to build a solution, so that they know what happens and they can choose a solution that suits them. For instance, there is no canonical way to fix missing radii because they were negative, so I'd let the user manually choose how to address them.
What do you think about this way to proceed?
super useful summary @aeisenbarth , I also agree with @LucaMarconato that a separate tool is maybe too much of an overkill. I think the best way would be to stick to the format version as much as possible, and reflect this in the code, without changing the API.
(?) Ensure reader can read all old data (https://github.com/scverse/spatialdata/issues/655, https://github.com/scverse/spatialdata/issues/624) that has not been migrated.
I think this ideally would be true, we should strive to make this possible imho.
Is your feature request related to a problem? Please describe. When the specification of SpatialData is changed, existing datasets do not have these changes, and under circumstances may even become incompatible.
For the in-memory representation, the library's reader functions support reading older versions (in most cases). However, users may want to upgrade the on-disk data to the latest version. Another case is when something is not covered by backward-compatibility of readers, e.g. due to errors in the data (#655).
Describe the solution you'd like As discussed earlier, we want a tool for migrating data, if more format changes occur in future. I open this new issue for tracking this, since #655 was too specific and is closed.
Describe alternatives you've considered
write
to save in the latest format: Some cases may not be covered by readers.spatialdata
library: Migration is not a frequent use case and would bloat the library, especially if supporting special cases (one-time issues, erronous data). Additionally, backward-compatibility of certain features can be easily deprecated in the library, while still preserving it in a separate tool.spatialdata-migrate
tool