nipy / nipype

Workflows and interfaces for neuroimaging packages
https://nipype.readthedocs.org/en/latest/
Other
748 stars 530 forks source link

Simplified/alternative way to specify models #1359

Open tyarkoni opened 8 years ago

tyarkoni commented 8 years ago

It would be nice to abstract some of the model specification details into a separate class that can handle high-level selection and filtering of regressors.

For example, consider a case where there are event files for 10 different conditions in a directory. A user might want to construct several different models, with different subsets of conditions, or even different subsets of individual events. At present, this preprocessing is left entirely up to the user (at least, I think that's the case--correct me if I've missed something). I.e., the user is expected to pass in either a Bunch of the exact names/onsets/durations (leaving them to pre-select and filter the inputs), or the exact set of event files. There's nothing wrong with this, but it can end up being quite verbose if one intends to fit many models, and clarity is reduced.

Now consider something like this:

# Suppose there are files for subject 1 - 10, and conditions A - J (i.e., 100 files)
mf = ModelFactory('path/to/all/my/event_files*.txt')
# First model: conditions A, C, and E for all subjects
mod1 = mf.get_model(conditions=['A', 'C', 'E'])
# Second model: all event types for selected subjects
mod2 = mf.get_model(subjects=[1, 5, 8, 9]
# Third model: filter by time, selecting all events with onsets < 100s
mod3 = mf.get_model(query='onset < 100')

In each case, get_model() returns an initialized instance of class SpecifyModel that behaves normally. Nothing much changes internally, but the user gets to specify potentially complex models in just one or two lines of code, instead of having to do extensive preprocessing. This could also be implemented as a single method in modelgen.py that does it all in one shot--i.e., you pass in a list of lists, some Bunches, a pandas DataFrame, or a list of event files, and then various keyword arguments for filtering/selecting. It would return a SpecifyModel instance again.

If there's agreement that something like this would be useful, I'd be happy to give it a shot--I have a bunch of code that would be pretty easy to adapt.

tyarkoni commented 8 years ago

On looking into this some more, I think a reasonable way to go about this would be to add a new interface called something like SpecifyEvents. This would typically slot in right before SpecifyModel. Right now SpecifyModel handles both (i) the conversion from discrete events to predictors in the design matrix, and (ii) the extraction of events from 1/2/3-column event files. The proposal is to refactor the code so that there's a clear distinction between these two. Specifically:

Thoughts?

tyarkoni commented 8 years ago

I guess to avoid breaking the API, SpecifyModel could remain untouched--there probably wouldn't be more than a few lines of duplication (e.g., to gen_info()). So this could be pretty painless. Should I work up a PR?

satra commented 8 years ago

sounds good.

mwaskom commented 8 years ago

:+1: this kind of thing is much more annoying that it should be

mwaskom commented 8 years ago

By the way, it's nice to read 3-column files with a different file per run/condition so that people who are transitioning from FSL etc. don't need to change their experiment code, but a while back I moved my own stuff over to just needing a single file per experiment with run, condition, onset, duration, and value fields, and it's a lot cleaner...

tyarkoni commented 8 years ago

@mwaskom totally agree. I use a generic "Dataset" class for almost everything. It basically just wraps a single pandas DataFrame and adds a bunch of reading/writing utilities (e.g., for consolidating a bunch of E-Prime files, selecting and renaming columns, etc.). My plan was to adapt that here. (I hope no one objects to using pandas internally--it would be kind of a pain to do a lot of this stuff in pure Python.)

chrisgorgo commented 8 years ago

Adding more interested parties: @jbpoline @bthirion.

On my side, I would only suggest that one of the input types should be BIDS style events.tsv file (docs: https://docs.google.com/document/d/1HFUkAEE-pB-angVcYe6pf-fVf4sCpOHKesUvfb8Grc/edit#heading=h.daip42kp5ndz example: https://github.com/INCF/BIDS-examples/blob/master/ds002/sub-02/func/sub-02_task-deterministicclassification_run-01_events.tsv)

tyarkoni commented 8 years ago

@chrisfilo I think a 4-column format like what @mwaskom describes is consistent with the BIDS spec, right? We can just stipulate that in addition to the mandatory onset and duration columns, columns called amplitude (or value) and condition (or type) will be detected and handled appropriately. I think this could go in modelgen.gen_info() right away, since that already checks for and handles 1/2/3 column formats.

jbpoline commented 8 years ago

@tyarkoni having a standard term like "amplitude" or "weight" is reasonable, in which case the code constructing the model can assume that this is the amplitude when not specified. But if we go for a file that describes the model, this should be flexible enough to specify any column of a tsv file. In https://github.com/jbpoline/bids2pype there is some very early code to go though a tentative model specification json file that assumes a BIDS organization to some nipype Bunch or other inputs (this is in a extremely early stage, please be forgiving...) Coming back to the original issue, separating how we get events timing/magnitude from data sources and the model specification is clearly a good idea to me.

bthirion commented 8 years ago

+1 on separating information extraction about the events from model specification +1 also for relying on pandas: it makes the code much more concise That's basically what we do in nistats, although we're not yet BIDS-compliant.

tyarkoni commented 8 years ago

Looping @qmac into the discussion.