Add Annotator Class - Githubissues

jdkent commented 3 years ago

Summary

There are variables associated with a study that are always the same such as sample_size, however, certain variables may depend on the dataset the study is part of, like results from topic modeling.

Annotations should be separate from the Dataset object to help keep Datasets immutable.

Additional details

Next steps

The Annotator class would take in a Dataset and output an Annotation object (or just a dataframe?).
All classes within the annotate module would inherit from the Annotator Class.
The output from an Annotator would used as optional input into an Estimator (which could mean passing in 4 arguments into an Estimator that takes 2 datasets, one annotation for each)
I believe there was something about treating the annotations differently depending on whether they were reading from images or coordinates, but I cannot adequately recapture that point (@tyarkoni could you help here if you remember?)

(the definition of the Annotation objects will eventually live in the neurostore API client library, which could then be pushed/pulled from neurostore)

tsalo commented 3 years ago

This may be related to #248.

EDIT: To clarify- I'm not arguing against an Annotator class. I just want to take some time to think more about it before I weigh in, and there may be useful info in that old issue.

tsalo commented 3 years ago

The Annotator class would take in a Dataset and output an Annotation object (or just a dataframe?).

I think a new class would be necessary, since some Annotators (especially GCLDA) would produce multiple outputs. In GCLDA's case, we would have arrays (or DataFrames) of (1) probability of term given topic, (2) probability of topic given study, and (3) probability of voxel given topic.

The plan in #248 was to have Annotators act like Transformers- i.e., the Annotator would take in a Dataset and return a Dataset with an updated annotations attribute. Unfortunately, that suffers from the same limitation as just returning a DataFrame.

tsalo commented 2 years ago

With #606 I have a strong motivator to write an Annotator class, but I'm still not sure how we should incorporate topic-word (LDA & GCLDA) and topic-voxel (GCLDA) arrays into the Dataset, or what the alternative Annotation class would look like. @jdkent, any thoughts? Has the neurostore team done anything with Annotations in the API?

tsalo commented 2 years ago

Per discussion on Slack, we could just stick these distributions (whether as arrays or DataFrames) into an attribute that we don't work too hard to standardize. It will basically mean that we assume no tools except NiMARE will use these "extra" distributions.

Something like

pprint(Annotator.distributions_)

{
    "p_topic_g_token": numpy.ndarray,
    "p_topic_g_token_df": pandas.DataFrame
}

WDYT?

neurostuff / NiMARE

Add Annotator Class #488

Summary

Additional details

Next steps