spacetx / starfish

starfish: unified pipelines for image-based transcriptomics
https://spacetx-starfish.readthedocs.io/en/latest/
MIT License
223 stars 67 forks source link

SpotFindingResults Serialization to Disk (and loading) #1961

Open ctcisar opened 2 years ago

ctcisar commented 2 years ago

I have added two simple methods to save SpotFindingResults to disk and reload them. This is handled by calling the methods in physical_coord_ranges, log, and _results to save each object, and then writing a json file that identifies the indices and paths to each of these files. The loader opens this json and uses the loading method from each of said objects to recreate the original SpotFindingResults object.

This can be useful for storing, comparing, and troubleshooting intermediate results. Storing the contents of each xarray-like object as a netcdf allows for compatibility across most package version differences, unlike saving the entire SpotFindingResults to a Python pickle.

njmei commented 2 years ago

Would it be possible to not write the .json file and instead write all* the SpotFindingResults data (including log info) into 1 netcdf file?

berl commented 2 years ago

@ctcisar thanks for this! Making it easier to view intermediate results is super useful. I think @njmei may have also recently mentioned something like this being missing.
It would also be nice to have some test coverage here- something simple like https://github.com/spacetx/starfish/blob/master/starfish/core/types/test/test_decoded_spots.py that generates some data, saves it and reloads it using your new methods would be great.

ctcisar commented 2 years ago

@njmei I was thinking about doing something like that initially, but faced problems saving the SpotAttributes together. Unless a DataArray can contain items of type DataArray, and it's possible to save the whole thing at once?

@berl I wrote something like this using our own data as I was testing this, I'll write up and push a unit test proper shortly.

njmei commented 2 years ago

@ctcisar Because the SpotAttributes are pandas DataFrames (I think) under the hood, it should be possible to merge them into one DataFrame (with enough data to recreate individual SpotAttributes later) then convert to xarray and save together with coordinate data as a netcdf.

If I have some spare time this week, I'll see if can come up with an implementation that only produces/loads 1 file (probably netcdf). Starfish is already pretty heavy in terms of all the metadata files it produces, it would be nice to have a solution that doesn't produce even more metadata-like files.

ctcisar commented 2 years ago

@berl I added a testing file in the same format as the provided test, hopefully it's up to spec.

In the process of writing this, I realised that alternating (obj type Log).encode() and Log.decode() will result in each result getting nested in an additional dict, structured as {"log": [data]}. I do not know if this is intended behavior, but have fixed the loader to accommodate for this.

berl commented 2 years ago

@ctcisar thanks for adding the tests. sorry about the delays in the CI tests- I started them up and the linter fails now. can you run that locally, make fixes and push here again? The CI should run automatically now

ctcisar commented 2 years ago

@berl I was having difficulty getting flake8 to run on my system, it doesn't like the way my python environment is set up. I addressed all the things mentioned in CI on my last commit.

The error I get when I run make all locally (with the flake8 lines commented out of the Makefile) is the following:

AttributeError: 'DataArray' object has no attribute 'to_numpy'

But I'm not sure where this is coming from, because DataArray does have that. (notably, when I tried to run the xr version of assert equal, it ran into errors with non-str keys. So I used this to work around it)

ctcisar commented 2 years ago

I am at a bit of a loss for the last flake8 error. It says

starfish/core/types/test/test_saving_spots.py:9:1: I100 Import statements are in the wrong order. 'from starfish.core.types import PerImageSliceSpotResults, SpotAttributes, SpotFindingResults' should be before 'from starfish.types import Axes, Coordinates, Features'

Even though the order it is describing is the order they are imported in. I have also tried the reverse order, and that didn't make a difference.