reillytilbury / coppafish

Python version of iss software
MIT License
4 stars 2 forks source link

Notebook refactor #248

Closed paulshuker closed 2 months ago

paulshuker commented 4 months ago

Interface:

Not much will change about how the notebook is interfaced. You create a new notebook or load in an existing notebook by doing

notebook = Notebook(“path/to/notebook”)

You create a new notebook page that does not exist in the notebook already (new_page = NotebookPage(“page_name”)). You add variables to the notebook page (new_page.variable_0 = “data”), then you add the notebook page to the notebook (notebook += new_page), at which point the notebook automatically saves itself with the new page to disk.

Structure:

The idea is that the notebook will become the only thing that is made during a pipeline run inside of the output directory, except for the pipeline.log to record outputs and timings.

The notebook will have a new structure. It will become a directory and have the structure:

notebook/ ├─ metadata.json ├─ page_0/ │ ├─ metadata.json │ ├─ small_variable_0.json │ ├─ small_variable_1.json │ ├─ small_variable_2.json ├─ page_1/ │ ├─ metadata.json │ ├─ large_array.zarr/ │ ├─ small_array.npz …

I think a .json or a .txt file is a good candidate for storing small variables like strings and lists because a user/developer can just open these files in a text editor and see exactly what value the variable has. It is just more convenient than having to do from coppafish import Notebook; nb = Notebook(“path/to/notebook.npz”); print(nb.page_name.variable_name) to see a simple value.

Like before, there will be a dictionary of page names (“extract”, “filter”, …) each one containing a dictionary of variable names (“use_tiles”, …) that each page contains. This information has been stored in a json file in the past called notebook_comments.json. I am going to make this programmatic instead by putting it inside a python script. The dictionary of variable names will contain a description for each variable (this already exists) but it will also have a datatype associated with it (this already exists but only for documentation purposes). I will use the datatype to decide how to save each variable. For example, numpy array variables will be saved as .npz compressed files, “zarr” specified will be saved as zarr arrays which has the advantage of being memory mapped (lazy loaded) so that any large arrays do not hog up all the RAM (like optical flow results). Unlike before, if a variable is added with the wrong type, like you add a zarr array to a variable expected to be of type “numpy array”, it will crash. Like before, any variable not in the variable dictionary that tries to be added to a page will cause a crash. Like before, there is an error if a notebook page is attempted to be added to a notebook which does not have all the variable names filled in.

Config:

In the alpha 0.10.4, the notebook would have some perception of the config built in to it. We will keep some of this functionality. The notebook, when first created, must be given the path to the config file. The notebook will then store all config variables within itself. Then, if the config file ever changes, a warning will be given to the user that it has now changed whenever a new notebook page is added to the notebook. This method will require the config file to not move location during a pipeline run, but I do not think any user will ever do that!