neurostuff / NiMARE

Coordinate- and image-based meta-analysis in Python
https://nimare.readthedocs.io
MIT License
179 stars 58 forks source link

Add Annotator Class #488

Open jdkent opened 3 years ago

jdkent commented 3 years ago

Summary

There are variables associated with a study that are always the same such as sample_size, however, certain variables may depend on the dataset the study is part of, like results from topic modeling.

Annotations should be separate from the Dataset object to help keep Datasets immutable.

Additional details

Next steps

(the definition of the Annotation objects will eventually live in the neurostore API client library, which could then be pushed/pulled from neurostore)

tsalo commented 3 years ago

This may be related to #248.

EDIT: To clarify- I'm not arguing against an Annotator class. I just want to take some time to think more about it before I weigh in, and there may be useful info in that old issue.

tsalo commented 3 years ago

The Annotator class would take in a Dataset and output an Annotation object (or just a dataframe?).

I think a new class would be necessary, since some Annotators (especially GCLDA) would produce multiple outputs. In GCLDA's case, we would have arrays (or DataFrames) of (1) probability of term given topic, (2) probability of topic given study, and (3) probability of voxel given topic.

The plan in #248 was to have Annotators act like Transformers- i.e., the Annotator would take in a Dataset and return a Dataset with an updated annotations attribute. Unfortunately, that suffers from the same limitation as just returning a DataFrame.

tsalo commented 2 years ago

With #606 I have a strong motivator to write an Annotator class, but I'm still not sure how we should incorporate topic-word (LDA & GCLDA) and topic-voxel (GCLDA) arrays into the Dataset, or what the alternative Annotation class would look like. @jdkent, any thoughts? Has the neurostore team done anything with Annotations in the API?

tsalo commented 2 years ago

Per discussion on Slack, we could just stick these distributions (whether as arrays or DataFrames) into an attribute that we don't work too hard to standardize. It will basically mean that we assume no tools except NiMARE will use these "extra" distributions.

Something like

pprint(Annotator.distributions_)
{
    "p_topic_g_token": numpy.ndarray,
    "p_topic_g_token_df": pandas.DataFrame
}

WDYT?