implement zarr-based caching for major classes

aalok-sathe commented 2 years ago

we need reliable state-caching for most classes to persist results to the disk, for later analysis and reuse in pipelines. if cached results exist, they may be reused based on a flag (e.g. overwrite_cache=False)

aalok-sathe commented 2 years ago

proposal: make the __repr__ method of each Cacheable class uniquely identify that instance. E.g., the repr(BrainScore()) should contain information about Mapping, Metric, and the encoders (all this can come from respective calls to the repr methods of these objects)

below list is in the form:

[ ] Object to repr()
- entity it depends on
[ ] BrainScore
- Mapping
- Metric
- Encoder1 outputs
- Encoder2 outputs (should we create a class EncoderOutput, for more logical dependency in cache handling?) @lipkinb @gretatuckute
[ ] Mapping
- str algorithm
- hparams? tbd
[ ] Metric
- str algorithm
[x] EncoderOutput (?)
- Encoder
- Dataset
[ ] HFEncoder
- str algorithm (pretrained_model_name_or_path)
- str aggregation choices
- ~Dataset~
[ ] BrainEncoder
- ~Dataset~
[ ] Dataset
- str path to the data

aalok-sathe commented 2 years ago

zarr is unable to cache xarrays with dtype object in them. Somehow we're getting dtype object bleed in from somewhere. Once that is corrected to string, this issue disappears. This issue is referenced here: https://github.com/pydata/xarray/issues/3476 It is partially sovled by commits in #34

language-brainscore / langbrainscore

implement zarr-based caching for major classes #28