neurostuff / NiMARE

Coordinate- and image-based meta-analysis in Python
https://nimare.readthedocs.io
MIT License
179 stars 58 forks source link

Refactor storage of annotations in Datasets #617

Open tsalo opened 2 years ago

tsalo commented 2 years ago

Summary

Datasets currently store annotations as an attribute .annotations, which is a pandas DataFrame with one column for each label. Groups of annotations (e.g., Neurosynth TF-IDF values vs. LDA topics) are distinguished with prefixes in the column names. Per today's call, a better structure would be to have Dataset.annotations be a dictionary containing Annotation objects. The keys to the dictionary would be the annotation group names.

Annotation objects will have the following attributes/methods (names are up for debate):

Annotators will then operate more like Estimators than Transformers. Namely, they will ingest Datasets and return Annotation objects (as proposed by @jdkent in #488). Users will have to add the Annotations to their Datasets themselves.

Next steps

  1. Create new Annotation class in #607.
  2. Refactor Dataset.annotations and associated conversion functions.
  3. Refactor Dataset search methods and decoding functions/classes.
tsalo commented 2 years ago

I started drafting an Annotation class and realized a couple of things:

  1. If we want topic-voxel arrays for GCLDA, then we need a masker object for mapping from images to arrays and vice versa. Alternatively, I guess we could keep the arrays as niimgs, but that could increase the size of the object and slow down any steps where we use those arrays.
  2. If we want topic-term arrays, then we either need to store them as DataFrames or as numpy arrays with additional attributes containing the index and column names needed to convert back to DataFrames later.

I started playing around with creating new classes for "image arrays" and "label arrays", but I don't know if that's better or worse in the long run than storing images and DataFrames, respectively. @jdkent do you have an opinion on this?

EDIT: In any case, I'll probably create a new class for the "other_distributions" attribute that will basically be a dictionary that only accepts certain object types as values.

tsalo commented 2 years ago

Also, for the to_filename method, I realized that just saving a TSV/CSV file wouldn't include the metadata we need for topic-based annotations. Should it also save an associated JSON file with that metadata?

tsalo commented 2 years ago

I'm starting to worry that storing a Studyset's annotations as a single attribute, with individual analyses linked by IDs across annotations and other attributes, will be a problem for NIMADS integration. Do we instead need to nest Annotations within Analyses within Studies within Studysets? If so, then I don't think we can break the refactor into discrete PRs the way I was planning. @jdkent, WDYT?