processing tools - plugin system

magland commented 11 months ago

@bendichter @luiztauffer

I started to put together a plugin system for processing tools. Right now, the plugins are included directly in the neurobass repo, but later we can separate them out into their own repos, and provide an easy recipe for users to contribute their own processing tools. Each compute resource can have its own set of processing tools.

I am using pydantic (thanks Luiz for pointing me to that) to define the schema for each tool. Those schemas get automatically uploaded to the neurobass database when a compute resource starts up, and then they are viewable in the tools tab:

https://neurobass.vercel.app/project/erjytuke?tab=processing-tools

Here are the definitions of the mountainsort5 and kilosort3 processing tools

https://github.com/scratchrealm/neurobass/blob/main/python/neurobass/processing_tools/spike_sorting/Mountainsort5ProcessingTool.py

https://github.com/scratchrealm/neurobass/blob/main/python/neurobass/processing_tools/spike_sorting/Kilosort3ProcessingTool.py

And here is the plugin that includes them

https://github.com/scratchrealm/neurobass/blob/main/python/neurobass/processing_tools/spike_sorting/SpikeSortingPlugin.py

To create a custom processing tool, the user would:

(a) create a new python package
(b) create the processing tools and the plugin class as in the above example and make sure the plugin is exported in the main __init__.py of the package
(c) configure the compute resource to register plugins from the new package
(d) restart the compute resource

A nice feature is that everything is defined in one place, in Python.

I haven't finished gluing this all together, so the web app is still using the old system for the actual processing.

bendichter commented 11 months ago

This looks like a good framework! I'd like to have logic that can constrain the input parameters. For example, for spike sorting the electrical_series_path should be a drop-down of all the ElectricalSeries in the file. Would it be possible to refactor this so that the parameters form could be informed by the file?

magland commented 11 months ago

This looks like a good framework! I'd like to have logic that can constrain the input parameters. For example, for spike sorting the electrical_series_path should be a drop-down of all the ElectricalSeries in the file. Would it be possible to refactor this so that the parameters form could be informed by the file?

Pydantic has a way to contrain fields. For example, constraining an integer to be between 1 and 10:

class MyModel(BaseModel):
    value: int = Field(..., ge=1, le=10)

So I plan to support that in the GUI. Of course this doesn't handle the case where we want a dropdown with names of valid paths within the NWB file. For that I was thinking of hard-coding something on the front-end (for all spike sorters) until we can figure out a better way to specify this.

bendichter commented 11 months ago

Could we have a function for determining if an NWB file is appropriate for a given analysis?

magland commented 11 months ago

Could we have a function for determining if an NWB file is appropriate for a given analysis?

I think we should hold off on that until we see more clearly the need emerging. For now, tools can be tagged as "spike_sorter" which signals to the front-end that it can be used in situations where the nwb file has an ElectricalSeries object in /acquisition.

magland commented 11 months ago

BTW note: all jobs have been deleted from the database, and processing won't work again until I emerge from this refactoring.

bendichter commented 11 months ago

OK. This is going to come up for nearly every processing tool. Image segmentation only makes sense on data that have a OnePhotonSeries or a TwoPhotonSeries. Pose estimation only make sense on inputs that have behavioral video in an ImageSeries object. I'm fine with putting this off though and building the app specifically for spike sorting for now so we can build a solid user experience for a particular use-case.

bendichter commented 11 months ago

Also, there are other tools that have built this sort of analysis plugin system, e.g. FlyWheel Gears. Here is an example gear: https://github.com/flywheel-io/exchange/blob/master/gears/flywheel/example-gear.json This seems like a decent system for registering different analyses with a platform, and I think it makes sense to follow the neuroimaging community where it is convenient to do so

luiztauffer commented 11 months ago

That is looking really cool, @magland ! I have a few ideas, would like to hear your opinion on these:

Tools as pip packages vs containers:

Each compute resource can have its own set of processing tools.

Does that mean that the processing tools will need to be installed in the compute resource? That approach would be difficult to maintain and very quickly create incompatibilities between tools, I believe.

What if we had the tools being built, distributed and used as containers? That would solve a couple of problems:

the developer of the tool is able to define the whole running environment, including system level dependencies. This was one of the main selling points of NeuroCAAS
guarantees environment isolation, no risk of dependencies change
easy to just pull and run, no risk of dependencies conflicts as there’s no need to install
we could skip steps c) and d)

Our template repository for tools could have a workflow actions that auto builds and publishes the images in GHCR, for free. The NeuroBAAS job definitions would then include the tool ghcr url as the source to be pulled from. To add a new tool to be used, all the user would have to do is to paste a github url in a "Add tool" field within the Tools page, or something like that...

Define NWB data types required by a tool

As @bendichter pointed out, this would be a very important feature. With the data-centric approach, it would be easy to, once the user selects its source data, parse its metadata and check which data types are present and, as consequence, which of the available tools could be applied to it. This would also highlight one of the main strengths of the NWB format, which is the standard data types, and indirectly encourage best practices when saving experimental data.

From the tools development perspective, we already need to write arguments such as electrical_series_path, so it wouldn’t be much more hassle to explicitly define the required (and eventually optional as well) NWB data types for running that tool, in a metadata file for example. I believe this would also encourage the tool developers to think in a data-centric manner, at least when writing the wrappers of their original tools to be used by this system.

I believe that eventually we might also want the tools to define the model of that data it produces, but it's not so clear in my head how that would go.

Tools vs plugins

What would be the advantage of grouping tools in plugins, e.g. SpikeSortingPlugin? Is there something specific about the way they are read by the system, or shared functionality that we’re leveraging, that requires them to be grouped? For example, if someone publishes a new sorting algorithm and wants to wrap it up and publish as a new neurobaas tool, would it need to be included in the SpikeSortingPlugin and, therefore, in the central tools repository?

I think I’m imagining a system were tools can be developed and used independent from others, although they could definitely be grouped in the frontend by tags, keywords, etc…

magland commented 11 months ago

@luiztauffer I like your idea for distributing tools as containers instead of Python packages. IMO, the package approach is much more convenient during development and testing, for creating quick wrappers to new or existing tools, which is why the current implementation is the way it is. However, I can see there is a big advantage of distributing the plugins as containers, for the reasons you listed. But I feel like there needs to be a way to create and test tools without needing to push to github or prepare a container at each iteration. What I was imagining was to enable the package plugin mechanism, but then provide a recipe for converting that into a container image and publishing in order to become a more official and distributeable tool.

Regarding tools vs plugins, I was roughly following the pattern of other plugin/extension systems, such as VS code. I like to avoid having something like a json spec that needs to be kept in-sync with the implementation code and that requires following a particular format which might break over time. The alternative I am using here is to register the tools as part of a Python class. But I'm open to revising this system. It's difficult to settle on a particular design, because there are so many different ways this could be done.

I agree, it will be important to specify which NWB types are appropriate for a given tool. I guess we'll need to develop a spec for how this would be included in the meta data for the tools. I think we should be cautious about allowing an arbitrary Python function to determine this because that's not easy to access on the front-end. Rather it should be specified in some metadata -- but of course that presents a challenge of how to define that.

scratchrealm / neurobass