Open tsalo opened 2 years ago
I started drafting an Annotation class and realized a couple of things:
I started playing around with creating new classes for "image arrays" and "label arrays", but I don't know if that's better or worse in the long run than storing images and DataFrames, respectively. @jdkent do you have an opinion on this?
EDIT: In any case, I'll probably create a new class for the "other_distributions" attribute that will basically be a dictionary that only accepts certain object types as values.
Also, for the to_filename
method, I realized that just saving a TSV/CSV file wouldn't include the metadata we need for topic-based annotations. Should it also save an associated JSON file with that metadata?
I'm starting to worry that storing a Studyset's annotations as a single attribute, with individual analyses linked by IDs across annotations and other attributes, will be a problem for NIMADS integration. Do we instead need to nest Annotations within Analyses within Studies within Studysets? If so, then I don't think we can break the refactor into discrete PRs the way I was planning. @jdkent, WDYT?
Summary
Datasets currently store annotations as an attribute
.annotations
, which is a pandas DataFrame with one column for each label. Groups of annotations (e.g., Neurosynth TF-IDF values vs. LDA topics) are distinguished with prefixes in the column names. Per today's call, a better structure would be to haveDataset.annotations
be a dictionary containingAnnotation
objects. The keys to the dictionary would be the annotation group names.Annotation
objects will have the following attributes/methods (names are up for debate):.study_term_weights
: A pandas DataFrame containing the annotation group's labels and weights. This is the main attribute of interest for NeuroStore..other_distributions
: A dictionary containing other distributions (e.g., topic-term weights, topic-voxel weights), whether in numpy.ndarray or pandas.DataFrame form..metadata
: BIDS-like metadata, possibly with top terms for topic models. Also software info and provenance..to_filename()
: To save the study_term_weights attribute as a CSV file. We discussed having JSON-based storage, but it seems like CSVs would be easier for NeuroStore to work with here.Annotators will then operate more like Estimators than Transformers. Namely, they will ingest Datasets and return Annotation objects (as proposed by @jdkent in #488). Users will have to add the Annotations to their Datasets themselves.
Next steps