Refactor storage of annotations in Datasets

tsalo commented 2 years ago

Summary

Datasets currently store annotations as an attribute .annotations, which is a pandas DataFrame with one column for each label. Groups of annotations (e.g., Neurosynth TF-IDF values vs. LDA topics) are distinguished with prefixes in the column names. Per today's call, a better structure would be to have Dataset.annotations be a dictionary containing Annotation objects. The keys to the dictionary would be the annotation group names.

Annotation objects will have the following attributes/methods (names are up for debate):

.study_term_weights: A pandas DataFrame containing the annotation group's labels and weights. This is the main attribute of interest for NeuroStore.
.other_distributions: A dictionary containing other distributions (e.g., topic-term weights, topic-voxel weights), whether in numpy.ndarray or pandas.DataFrame form.
.metadata: BIDS-like metadata, possibly with top terms for topic models. Also software info and provenance.
.to_filename(): To save the study_term_weights attribute as a CSV file. We discussed having JSON-based storage, but it seems like CSVs would be easier for NeuroStore to work with here.

Annotators will then operate more like Estimators than Transformers. Namely, they will ingest Datasets and return Annotation objects (as proposed by @jdkent in #488). Users will have to add the Annotations to their Datasets themselves.

Next steps

Create new Annotation class in #607.
Refactor Dataset.annotations and associated conversion functions.
Refactor Dataset search methods and decoding functions/classes.

tsalo commented 2 years ago

I started drafting an Annotation class and realized a couple of things:

If we want topic-voxel arrays for GCLDA, then we need a masker object for mapping from images to arrays and vice versa. Alternatively, I guess we could keep the arrays as niimgs, but that could increase the size of the object and slow down any steps where we use those arrays.
If we want topic-term arrays, then we either need to store them as DataFrames or as numpy arrays with additional attributes containing the index and column names needed to convert back to DataFrames later.

I started playing around with creating new classes for "image arrays" and "label arrays", but I don't know if that's better or worse in the long run than storing images and DataFrames, respectively. @jdkent do you have an opinion on this?

EDIT: In any case, I'll probably create a new class for the "other_distributions" attribute that will basically be a dictionary that only accepts certain object types as values.

tsalo commented 2 years ago

Also, for the to_filename method, I realized that just saving a TSV/CSV file wouldn't include the metadata we need for topic-based annotations. Should it also save an associated JSON file with that metadata?

tsalo commented 2 years ago

I'm starting to worry that storing a Studyset's annotations as a single attribute, with individual analyses linked by IDs across annotations and other attributes, will be a problem for NIMADS integration. Do we instead need to nest Annotations within Analyses within Studies within Studysets? If so, then I don't think we can break the refactor into discrete PRs the way I was planning. @jdkent, WDYT?

neurostuff / NiMARE

Refactor storage of annotations in Datasets #617

Summary

Next steps