Add support for task parameters / attributes

oruebel commented 2 years ago

Motivation

This issue has come up as part of discussions during the NWB User Days 2022 in relation to a use case by @gitelian.

Problem

Behavioral experiments often involve additional parameters, e.g., the size of the reward, the time delays for rewards and other actions etc.. These parameters are often important metadata to facilitate analysis, query, and interpretation of the data. In practice the parameters can be both static (i.e., defined before the experiment) or change dynamically (typically on a per-trial basis).

BEADL

After talking with @Michael-Wulf, and if I understand the BEADL XML correctly, then these parameters are defined in the XML of the task program as part of the <BeadlArguments>. The definition of these parameters appear to be defined as strings in the XML file, but I believe they typically take the form of either numeric values, text, or more complex programmatic logic to modify the parameters automatically between trials.

https://github.com/rly/ndx-beadl/blob/69fa67f3d0021327b41861359851969be2b4444a/docs/tutorial_nwb_userdays_2022/LightChasingTask.xml#L4-L10

The per-trial values of these arguments is then also recorded in the matlab file.

Suggested Change

It would be useful to support storage of task parameters as part of the Task type. The number and name of the parameters will depend on the particular task. We could parse the definition of the parameters from the XML file. I could see a few different options to describe this:
1. Option 1 would be to store these parameters as a TaskParametersTable with the columns name and value. I'm not sure what data type the value needs to be. text should work, but it would be nice if we could also representnumeric`` parameters and possibly other data types.
2. Option 2 would be to store the parameters as datasets in subgroup Task.task_parameter. This would have the advantage that the data type of each dataset can vary for each parameter. The disadvantage is that this potentially creates many small datasets, but I would assume that the number of parameters should be reasonably small.
3. Option 3 would be store the parameters as attributes of the Task type. This would also allow us to express arbitrary data types and at the same time avoid creating lots of small datasets. Unfortunately, I think we can not currenly express this with the schema language, since attributes have fixed names and no
  neurodata_type (if I'm not mistaken), so we can't have arbitrary user-defined attributes

I think either Option 1 or Option 2 would work. Option 1 has the disadvantage that we would be limited to text parameters but it would keep things concise in a table. Option 2 on the other hand would allow us to support arbitrary parameters but has the disadvantage that it potentially results in lots of small datasets and the user could store essentially anything they want as a parameter.

To store the per-trial values of these parameters, I believe we can just store those as user-defined columns on the Trials table. I.e,. I believe we probably don't need to extend the schema to store those values, but we should updated the parser for the matlab file to add those columns to the Trials table.

oruebel commented 2 years ago

During discussion of this issue with @mavaylon1 @rly @Michael-Wulf @oruebel we refined the discussion to the following 2 main options:

Option 1) "store the parameters as datasets in subgroup Task.task_parameter. This would have the advantage that the data type of each dataset can vary for each parameter. The disadvantage is that this potentially creates many small datasets"
Option 2) parse the Beadl xml for the argument definitions and store them (specifically the values as strings) in a Table with columns [name, value, output_type, type], which would be under Tasks. Currently, Beadl doesn't store the actual type, but rather the intended output type. To help us and to help Beadl have a more structured/rigorous definition of types, they would need to add on the actual value type of the definition. Example: <BeadlArgument name="ValveTime" expression="GetValveTime(RewardSize, ValveTime)" type="numeric" />. The value would be stored as a string and so the type would be string but the output_type would be numeric. Having this distinction would allow us to have a validator to check that the value can indeed be converted to the intended type. So in this example, even though the output_type is intended as a numeric, the validator won't check to see if it's convertible. On the other hand, if the user says the value='a', but the type=int, then the validator would catch that error because 'a' can't be converted to an int. An idea that may or may not be useful would be to adjust to_dataframe to convert this parameter to convert/reflect the value as the type.

Both options would have the data stored as individual argument columns on the TrialsTable.

oruebel commented 2 years ago

parse the Beadl xml for the argument definitions and store them (specifically the values as strings)

I think the important realization here is that these are indeed not actual values but rather expression that are used during the execution of the task program to then define/create the actual values. As such these expressions can take many forms; from simply constants to complex functions and programs that generate the values. As described above, because of this, the type of the expression is often not the same as the type of the values an expression generates.

The second key part then is that the actual values are stored in the TrialsTable. For constant expressions this may seem redundant (since this would create columns with constant values), but it is explicit and easy-to-use. Also, a user can still chose which arguments to record in the TrialsTable, and so it is possible to omit constant columns in the TrialsTabls and create the value from the definition that is stored in the TaskArgumentTable instead.

The value would be stored as a string and so the type would be string but the output_type would be numeric.

Based on the realization that these are expressions (rather than actual values), the notion to store these expressions as strings along with information about the type of the expression and the output_type of the values seems appropriate. With this in mind, Option 2 seems to be a good option.

Currently, Beadl doesn't store the actual type, but rather the intended output type. To help us and to help Beadl have a more structured/rigorous definition of types, they would need to add on the actual value type of the definition.

I agree, that would be a very clear and logical approach.

in a Table with columns [name, value, output_type, type], which would be under Tasks

Based on the discussion above, I think that instead of value, the term definition or expression is probably more appropriate. I think either term is fine, and since Beadl seems to already use the term expression for this, I think we can probably just keep it consistent and use the term expression.
Consistent with this, I would change the term type to expression_type to make it explicit that the value refers to the expression column.
I think the neurodata_type of the table should be something likeTaskArgumentsTable (or maybe either TaskParametersTable)
Do we also need to allow for an optional output_unit (stored as a string? In cases where there is no physical unit, the output_unit could be either an empty string or be set to the same value as output_type.

Both options would have the data stored as individual argument columns on the TrialsTable.

Here it would be useful to also have the unit attribute on each column, since many arguments will have physical units (e.g., time delays in seconds or reward amount in milliliters or grams)
Another option would be to make this an AlignedDynamicTable to make it easy to distinguish between columns that: i) define arguments of the Trial, ii) outcomes of the trial, and iii) definitions of the trials etc. This may not be strictly necessary, but I wanted to mention it because I think it could be useful.

oruebel commented 2 years ago

Should the TaskArgumentsTable be a compound Table (i.e., row-based table with fixed columns) or a DynamicTable (i.e., column-based table with support for dynamic addition of columns)? Table has the advantage that it enforces a strict structure and requires fewer datasets but will require extension if one wants to add columns. DynamicTable requires a more complex schema and more datasets but will allow users to add columns without requiring extensions. I personally don't have a strong preference for either. The main question I think is how likely we think it is that users will need to store additional columns.

@rly @mavaylon1

oruebel commented 2 years ago

@mavaylon1 can this issue be closed?

rly / ndx-structured-behavior