Simplified/alternative way to specify models

tyarkoni commented 8 years ago

It would be nice to abstract some of the model specification details into a separate class that can handle high-level selection and filtering of regressors.

For example, consider a case where there are event files for 10 different conditions in a directory. A user might want to construct several different models, with different subsets of conditions, or even different subsets of individual events. At present, this preprocessing is left entirely up to the user (at least, I think that's the case--correct me if I've missed something). I.e., the user is expected to pass in either a Bunch of the exact names/onsets/durations (leaving them to pre-select and filter the inputs), or the exact set of event files. There's nothing wrong with this, but it can end up being quite verbose if one intends to fit many models, and clarity is reduced.

Now consider something like this:

# Suppose there are files for subject 1 - 10, and conditions A - J (i.e., 100 files)
mf = ModelFactory('path/to/all/my/event_files*.txt')
# First model: conditions A, C, and E for all subjects
mod1 = mf.get_model(conditions=['A', 'C', 'E'])
# Second model: all event types for selected subjects
mod2 = mf.get_model(subjects=[1, 5, 8, 9]
# Third model: filter by time, selecting all events with onsets < 100s
mod3 = mf.get_model(query='onset < 100')

In each case, get_model() returns an initialized instance of class SpecifyModel that behaves normally. Nothing much changes internally, but the user gets to specify potentially complex models in just one or two lines of code, instead of having to do extensive preprocessing. This could also be implemented as a single method in modelgen.py that does it all in one shot--i.e., you pass in a list of lists, some Bunches, a pandas DataFrame, or a list of event files, and then various keyword arguments for filtering/selecting. It would return a SpecifyModel instance again.

If there's agreement that something like this would be useful, I'd be happy to give it a shot--I have a bunch of code that would be pretty easy to adapt.

tyarkoni commented 8 years ago

On looking into this some more, I think a reasonable way to go about this would be to add a new interface called something like SpecifyEvents. This would typically slot in right before SpecifyModel. Right now SpecifyModel handles both (i) the conversion from discrete events to predictors in the design matrix, and (ii) the extraction of events from 1/2/3-column event files. The proposal is to refactor the code so that there's a clear distinction between these two. Specifically:

SpecifyModel would only take as input Bunches that have been properly specified in terms of condition/onset/duration/amplitude; the expectation would be that the events passed in are pretty much ready, from an experimental standpoint (HRF convolution, temporal derivatives, etc. would all be left in SpecifyModel), and no further transformation or parsing of events would be performed.
SpecifyEvents would handle everything related to the generation of the experimental events. It would accept a variety of inputs via a data source (e.g., a list of event files in 1/2/3/4-column format; a pointer to a single behavioral file for the entire dataset; a pointer to subject-specific E-Prime output files that would be internally parsed; etc.). This would move some of the existing functionality currently called in SpecifyModel (e.g., gen_info()), and would add a bunch of new methods to recode and transform events passed in from the data source (e.g., I'm thinking of things like filtering events, crossing different factors to create factorial designs, etc.). The interface would be parameterized so that this could all be specified relatively easily by a user in the Interface initializer arguments.

Thoughts?

tyarkoni commented 8 years ago

I guess to avoid breaking the API, SpecifyModel could remain untouched--there probably wouldn't be more than a few lines of duplication (e.g., to gen_info()). So this could be pretty painless. Should I work up a PR?

satra commented 8 years ago

sounds good.

mwaskom commented 8 years ago

:+1: this kind of thing is much more annoying that it should be

mwaskom commented 8 years ago

By the way, it's nice to read 3-column files with a different file per run/condition so that people who are transitioning from FSL etc. don't need to change their experiment code, but a while back I moved my own stuff over to just needing a single file per experiment with run, condition, onset, duration, and value fields, and it's a lot cleaner...

tyarkoni commented 8 years ago

@mwaskom totally agree. I use a generic "Dataset" class for almost everything. It basically just wraps a single pandas DataFrame and adds a bunch of reading/writing utilities (e.g., for consolidating a bunch of E-Prime files, selecting and renaming columns, etc.). My plan was to adapt that here. (I hope no one objects to using pandas internally--it would be kind of a pain to do a lot of this stuff in pure Python.)

chrisgorgo commented 8 years ago

Adding more interested parties: @jbpoline @bthirion.

On my side, I would only suggest that one of the input types should be BIDS style events.tsv file (docs: https://docs.google.com/document/d/1HFUkAEE-pB-angVcYe6pf-fVf4sCpOHKesUvfb8Grc/edit#heading=h.daip42kp5ndz example: https://github.com/INCF/BIDS-examples/blob/master/ds002/sub-02/func/sub-02_task-deterministicclassification_run-01_events.tsv)

tyarkoni commented 8 years ago

@chrisfilo I think a 4-column format like what @mwaskom describes is consistent with the BIDS spec, right? We can just stipulate that in addition to the mandatory onset and duration columns, columns called amplitude (or value) and condition (or type) will be detected and handled appropriately. I think this could go in modelgen.gen_info() right away, since that already checks for and handles 1/2/3 column formats.

jbpoline commented 8 years ago

@tyarkoni having a standard term like "amplitude" or "weight" is reasonable, in which case the code constructing the model can assume that this is the amplitude when not specified. But if we go for a file that describes the model, this should be flexible enough to specify any column of a tsv file. In https://github.com/jbpoline/bids2pype there is some very early code to go though a tentative model specification json file that assumes a BIDS organization to some nipype Bunch or other inputs (this is in a extremely early stage, please be forgiving...) Coming back to the original issue, separating how we get events timing/magnitude from data sources and the model specification is clearly a good idea to me.

bthirion commented 8 years ago

+1 on separating information extraction about the events from model specification +1 also for relying on pandas: it makes the code much more concise That's basically what we do in nistats, although we're not yet BIDS-compliant.

tyarkoni commented 8 years ago

Looping @qmac into the discussion.

nipy / nipype

Simplified/alternative way to specify models #1359