Closed tsalo closed 4 years ago
It seems like the best place to do this is Dataset.get
, since that's where data are identified and compiled. The question, I think, is whether to explicitly describe the acceptable data and transformations on a case-by-case basis in _required_inputs
or to automatically identify them elsewhere (e.g., in Dataset.get
and/or a new transformations
module) and to assume that the same rules apply to all tools.
I'm pretty strongly against any changes to data being made at Dataset
level that weren't explicitly and specifically invoked by the user. If a user wants to explicitly convert from t to z via a function call, fine. But anything that involves automatic detection/remediation of missing data types should IMO live only in the Estimator
(and MetaResult
if needed). I don't think it's the Dataset
's responsibility to know what transformations are possible or to predict what inputs an Estimator
might need.
I also think we should restrict transformations to only simple, deterministic operations where there's no ambiguity. The examples in the list above seem fine, but for anything more complex than that, I think it becomes the user's job to make sure they have the necessary data.
I think that's reasonable, although I'll have to change how we're doing things now to make it work. Currently, Dataset.get
grabs the relevant data and reduces based on data availability. When calling it, we end up with a dictionary with keys for each of the requested data types and for IDs. Each value is a list of the data/IDs.
To implement transformations in the Estimator, I think I have to disable the reduction implemented in Dataset.get
, so that the returned dictionary has None
when a given contrast doesn't have the requested data. It will also need to grab all transformable data for each requested data type, in order to fill in the gaps, when possible, within the Estimator. I think that means we'll also need to describe the transformations in the _required_inputs
attribute, which will mean making that dictionary more... expressive, I guess. I'm not really sure how to do that, but maybe you could weigh in on the construction of the _required_inputs
dictionaries once we get to that step.
What do you think of this:
_required_inputs = {
'z_maps': (('image', 'z'),
{'transform': 't_to_z',
'inputs': {'t_maps': ('image', 't'),
'sample_sizes': {
'transform': 'mean',
'inputs': {'data': ('metadata', 'sample_size')}
}
}
}
)
}
The idea would be to first build a list of all of the possible inputs. In this case those would be Z maps, T maps, and sample sizes. We then use Dataset.get
to return a full list of all inputs, with None
for locations associated with missing data for individual inputs.
Then, we move down the list of inputs based on priority. Z maps are better than converted T maps, etc. If an experiment has a T map and sample size available, we apply the associated transform in the dictionary (t_to_z
) to those inputs to fill in the required input (z_maps
).
The logic is pretty straightforward, but I don't know if the actual structure of _required_inputs
is particularly good or if there is a cleaner way of specifying it.
This may be better handled in PyMARE, as proposed in https://github.com/neurostuff/PyMARE/issues/12.
PyMARE will have an "effect-size" module with a general conversion function. We can inspect the PyMARE ESMA functions to get their inputs, and then provide those required inputs and all available data for each study in the NiMARE Dataset to the conversion function. That will return the required inputs for each study that has some version of them, and the final set of studies with the required data can be fed into the PyMARE ESMA function.
The inspection, conversion, and data update should all happen within nimare.Dataset.get().
My current plan, in more detail:
inspect.getargspec
to get a list of arguments for the PyMARE Estimator.con
to y
). Thus we have a list of NiMARE inputs required for the PyMARE Estimator.
metadata
, images
, and coordinates
), which are all located in separate DataFrames.images
and peak-specific Z-statistics under coordinates
. Both will be labeled "z", which means that we need to automatically look for "z" under images
for IBMAs and "z" under coordinates
for CBMAs.I'm thinking about this some more, and I'm no longer sure how much sense it makes to use PyMARE's effectsize
model for data conversion in NiMARE:
compute_measure
is vectorized, but it can only vectorize operations involving identical inputs. In the NiMARE case, we'd have to loop over each study individually, and try to solve for whatever quantities are available, and the solving time in sympy could start to get expensive (as opposed to just a bit of Python logic).I think the upshot is it probably makes more sense to stick with a standalone Python function that takes as argument (a) the data for a single Study, (b) the intended mask, and (b) the target output measure(s), and returns arrays of the desired measures if they can be computed, or None
if they can't.
I'm also still not sold on doing this filtering/conversion magically inside get()
; what's the downside in doing this in two separate steps? I.e., in Step 1, you filter on whatever conceptually meaningful variables you want to; in Step 2, you explicitly call a function that takes both the Dataset
and some constraint (either an Estimator
instance, or a tuple of desired output metrics that must be computable), and returns a new Dataset
that contains only usable data. I like that way of doing things because it makes the distinction between data and estimation very clear to the user, and highlights the fact that feeding a single Dataset
to different estimators could potentially require dropping different sets of studies.
Note also that the identification of suitable studies for a given estimator can be separated from the actual data transformation. E.g., if you know that a study has sd
and n
images, you know that you can compute stderror
images, but you don't have to actually apply the transformation if you don't want to. (And I think the user should be able to call these things separately—i.e., filtering a Dataset
based on whether it can be passed to an IBMA RFX estimator should be a separate call from actually applying the conversions.)
It might also be worth thinking through the most common use cases before we start implementing anything. In terms of the estimators PyMARE currently supports, there are really only 4 possibilities I can think of, in descending order of occurrence:
SampleSizeBasedLikelihoodEstimator
.But... (4) is only practical if you're dealing with ROIs, because estimation is iterative and can't be easily vectorized. And (3) is unlikely to be useful in practice, because without sampling variances, or at least SDs, to weight or normalize by, different studies are likely to have maps on dramatically different scales, and I'm skeptical that (3) will produce sane results in most cases.
In practice, then, the main cases we need to worry about are (1) and (2). (2) is pretty trivial: the plausible alternative inputs are z and t. (1) is only slightly more complicated; in practice, we'll mostly be dealing with conversions between variance, SD, sampling variance, and standard error, and those are easy.
Am I missing other common scenarios? There's the 2-sample scenario, but that part at least is easy enough to use PyMARE for (as long as we can get a y
, sd
, and/or n
for each group separately, I'm comfortable using PyMARE's compute_measure
to extract y
and v
for the comparison).
I don't follow all the nuances with get()
, but agree that the SymPy functionality was highly over-engineered :-) There are are only a small number of mappings you'll ever need to do and we can enumerate them now easily.
On your list, I'd revise 2 to be:
This is based is this reference (the introduction, not the fancy method proposed):
Zaykin, D. V. (2011). Optimally weighted Z-test is a powerful method for combining probabilities in meta-analysis. Journal of Evolutionary Biology, 24(8), 1836–1841. https://doi.org/10.1111/j.1420-9101.2011.02297.x
I agree (3) can go (you don't need specialised meta-analysis software for modelling beta images) and (4) is very niche.
We discussed that, among the wide world of meta-analyses, we could consider more (e.g. correlation) but basically don't want to go there.
I've added a transforms module and am going to push a function that performs one of a small set of transforms on image files within a dataset soon. It will write out those files with the name format [study_id]-[exp_id]_[resolution]_[datatype].nii.gz
.
The conversions I'm going to support for now are:
Does that work?
Hi @tsalo: I think we need sharper terms for the variance, b/c on the face of it:
To me, there is really only one (mostly) unambiguous term: Standard error. This is the square root of the variance of the estimator (of a mean).
The other pretty unambiguous term is: Population variance. The variance of a set of samples drawn from a population (possibly after adjustment for some factors/covariates, at which point it might be known as residual variance).
Unadorned "variance" is just asking for trouble. That's why I have relunctantly have embraced FSL's varcope
as it unambiguously the squared standard error and not the population variance.
So, if we're not going to use varcope
what about using se2
... it's not elegant, but it will never be confused for a population variance. And for population variance we could just use var
but better would be popvar
or pvar
which will then never be confused with a squared standard error.
So! ANYWAY, I agree with your limited set of transformations, but I'd just put them as:
Though... do we need the last two?
Thank you again. Sorry I'm having such a hard time with the different kinds of variance!
Okay, so we have varcope == se2 == sampling variance (which I was calling var). We also have standard error (SE), population variance (popvar), and sample variance (samplevar). I guess I should catalogue types by software.
Given the above, I'm happy to limit variance-related transforms to "SE to varcope" and "samplevar to varcope" for now, so we can drop the last two transforms in your comment.
As for terminology, I'm sure that it will solidify around what NIMADS uses once we start work on #218, but I'm fine changing to varcope
for the current version. At least it will be easy to reference back to the FSL definition for anyone who gets confused.
Yes... and, to be clear "sample variance" and "sampling variance" are equivalent and unfortunately ambiguous (to me). The "sample variance" could be the "sample variance of a dataset", e.g. like np.var(X)
or it could be "sample variance of an estimator", np.var(X)/len(X)
Also, for reference: NIDM has nidm:'Contrast Standard Error Map': which references STATO's standard error of a contrast estimate.
AFNI does the cute thing of only carrying around T stats and contrast estimates, reverse engineering standard error of a contrast estimate as Est/Tstat.
nistats's "sd" is clear within the context of the T function, but not necessarily in general.
Given the above, I'm happy to limit variance-related transforms to "SE to varcope" and "samplevar to varcope" for now, so we can drop the last two transforms in your comment.
Sounds good. So you want to commit to varcope
? I'm not fussed but want to confirm... OK by me.
Thanks!
Yes... and, to be clear "sample variance" and "sampling variance" are equivalent and unfortunately ambiguous (to me). The "sample variance" could be the "sample variance of a dataset", e.g. like np.var(X) or it could be "sample variance of an estimator", np.var(X)/len(X)
I think I might hate statistics... or at least whoever came up with the terminology. But okay, I'll change samplevar
to samplevar_dataset
in the code to clarify that.
AFNI does the cute thing of only carrying around T stats and contrast estimates, reverse engineering standard error of a contrast estimate as Est/Tstat.
Okay, that's good at least. So I should expect standard error from AfNI models, and that's already covered by NIDM-Results.
So you want to commit to varcope? I'm not fussed but want to confirm... OK by me.
Yes, I'd like to commit to varcope, although I'm not going to push for any particular terminology in NIMADS, so it may change when you and @tyarkoni and anyone else who's contributing to NIMADS finalize the standard.
As part of
Estimator._validate_input
, we should allow transformations and combinations of data. Some transformations to include: