Open tyarkoni opened 8 years ago
On looking into this some more, I think a reasonable way to go about this would be to add a new interface called something like SpecifyEvents. This would typically slot in right before SpecifyModel. Right now SpecifyModel handles both (i) the conversion from discrete events to predictors in the design matrix, and (ii) the extraction of events from 1/2/3-column event files. The proposal is to refactor the code so that there's a clear distinction between these two. Specifically:
Thoughts?
I guess to avoid breaking the API, SpecifyModel could remain untouched--there probably wouldn't be more than a few lines of duplication (e.g., to gen_info()). So this could be pretty painless. Should I work up a PR?
sounds good.
:+1: this kind of thing is much more annoying that it should be
By the way, it's nice to read 3-column files with a different file per run/condition so that people who are transitioning from FSL etc. don't need to change their experiment code, but a while back I moved my own stuff over to just needing a single file per experiment with run, condition, onset, duration, and value fields, and it's a lot cleaner...
@mwaskom totally agree. I use a generic "Dataset" class for almost everything. It basically just wraps a single pandas DataFrame and adds a bunch of reading/writing utilities (e.g., for consolidating a bunch of E-Prime files, selecting and renaming columns, etc.). My plan was to adapt that here. (I hope no one objects to using pandas internally--it would be kind of a pain to do a lot of this stuff in pure Python.)
Adding more interested parties: @jbpoline @bthirion.
On my side, I would only suggest that one of the input types should be BIDS style events.tsv file (docs: https://docs.google.com/document/d/1HFUkAEE-pB-angVcYe6pf-fVf4sCpOHKesUvfb8Grc/edit#heading=h.daip42kp5ndz example: https://github.com/INCF/BIDS-examples/blob/master/ds002/sub-02/func/sub-02_task-deterministicclassification_run-01_events.tsv)
@chrisfilo I think a 4-column format like what @mwaskom describes is consistent with the BIDS spec, right? We can just stipulate that in addition to the mandatory onset and duration columns, columns called amplitude (or value) and condition (or type) will be detected and handled appropriately. I think this could go in modelgen.gen_info() right away, since that already checks for and handles 1/2/3 column formats.
@tyarkoni having a standard term like "amplitude" or "weight" is reasonable, in which case the code constructing the model can assume that this is the amplitude when not specified. But if we go for a file that describes the model, this should be flexible enough to specify any column of a tsv file. In https://github.com/jbpoline/bids2pype there is some very early code to go though a tentative model specification json file that assumes a BIDS organization to some nipype Bunch or other inputs (this is in a extremely early stage, please be forgiving...) Coming back to the original issue, separating how we get events timing/magnitude from data sources and the model specification is clearly a good idea to me.
+1 on separating information extraction about the events from model specification +1 also for relying on pandas: it makes the code much more concise That's basically what we do in nistats, although we're not yet BIDS-compliant.
Looping @qmac into the discussion.
It would be nice to abstract some of the model specification details into a separate class that can handle high-level selection and filtering of regressors.
For example, consider a case where there are event files for 10 different conditions in a directory. A user might want to construct several different models, with different subsets of conditions, or even different subsets of individual events. At present, this preprocessing is left entirely up to the user (at least, I think that's the case--correct me if I've missed something). I.e., the user is expected to pass in either a Bunch of the exact names/onsets/durations (leaving them to pre-select and filter the inputs), or the exact set of event files. There's nothing wrong with this, but it can end up being quite verbose if one intends to fit many models, and clarity is reduced.
Now consider something like this:
In each case, get_model() returns an initialized instance of class SpecifyModel that behaves normally. Nothing much changes internally, but the user gets to specify potentially complex models in just one or two lines of code, instead of having to do extensive preprocessing. This could also be implemented as a single method in modelgen.py that does it all in one shot--i.e., you pass in a list of lists, some Bunches, a pandas DataFrame, or a list of event files, and then various keyword arguments for filtering/selecting. It would return a SpecifyModel instance again.
If there's agreement that something like this would be useful, I'd be happy to give it a shot--I have a bunch of code that would be pretty easy to adapt.