language-brainscore / langbrainscore

[Marked for Deprecation. please visit https://github.com/brain-score/language for the migrated project] Benchmarking of Language Models using Human Neural and Behavioral experiment data
https://language-brainscore.github.io/langbrainscore/
MIT License
4 stars 1 forks source link

bug when loading dataset from cache. #35

Open benlipkin opened 2 years ago

benlipkin commented 2 years ago

tried running test_mean_froi_pereira2018_firstsessions.py

runs fine at first, but the second time, when loading from cache, I see this error:

Traceback (most recent call last):
  File "/Users/benlipkin/anaconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Users/benlipkin/anaconda3/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/Users/benlipkin/Desktop/Files/MIT/github/langbrainscore/examples/test_mean_froi_pereira2018_firstsessions.py", line 150, in <module>
    main()
  File "/Users/benlipkin/Desktop/Files/MIT/github/langbrainscore/examples/test_mean_froi_pereira2018_firstsessions.py", line 136, in main
    brsc_rdg_corr.run(sample_split_coord="experiment", calc_nulls=True)
  File "/Users/benlipkin/Desktop/Files/MIT/github/langbrainscore/langbrainscore/brainscore/brainscore.py", line 216, in run
    self.score(
  File "/Users/benlipkin/Desktop/Files/MIT/github/langbrainscore/langbrainscore/brainscore/brainscore.py", line 89, in score
    y_pred, y_true = self.mapping.fit_transform(X, Y, ceiling=ceiling)
  File "/Users/benlipkin/Desktop/Files/MIT/github/langbrainscore/langbrainscore/mapping/mapping.py", line 322, in fit_transform
    for cvfoldid, (train_index, test_index) in enumerate(splits):
  File "/Users/benlipkin/Desktop/Files/MIT/github/langbrainscore/.venv/lib/python3.8/site-packages/sklearn/model_selection/_split.py", line 333, in split
    raise ValueError(
ValueError: Cannot have number of splits n_splits=5 greater than the number of samples: n_samples=0.

If I comment out this section

        mpf_dataset = lbs.dataset.Dataset(
            xr.DataArray(),
            dataset_name="Pereira2018LangfROIs",
            _skip_checks=True,
        )
        mpf_dataset.load_cache()

and just recalculate the dataset instead:

mpf_dataset = lbs.dataset.Dataset(
        mpf_xr.isel(neuroid=mpf_xr.roi.str.contains("Lang")),
        dataset_name="Pereira2018LangfROIs",
    )

then the error goes away.

worth looking into.

aalok-sathe commented 2 years ago

This looks like a case of something being lost while caching or not recovered from cache at the time of load_cache

aalok-sathe commented 2 years ago

I'm able to replicate this stochastically; need to test Dataset caching behavior to see whether everything is being recovered or not.

aalok-sathe commented 2 years ago

something's going wrong while deserializing the cached Dataset xarray. All metadata is recovered but the data is converted to NaNs.

In [33]: e = xr.open_zarr('/home/aalok/.cache/langbrainscore/Dataset/(Dataset?dataset_name=pereira2018_mean_froi_Lang)/_xr_obj.xr')

In [34]: e.data
Out[34]: 
<xarray.DataArray 'data' (sampleid: 627, neuroid: 108, timeid: 1)>
array([[[nan],
        [nan],
        ...,
        [nan],
        [nan]],

       [[nan],
        [nan],
        ...,
        [nan],
        [nan]],

       ...,

       [[nan],
        [nan],
        ...,
        [nan],
        [nan]],

       [[nan],
        [nan],
        ...,
        [nan],
        [nan]]])
Coordinates:
    experiment  (sampleid) <U16 ...
  * neuroid     (neuroid) int64 0 1 2 3 4 5 6 7 ... 261 262 263 264 265 266 267
    passage     (sampleid) <U12 ...
    roi         (neuroid) <U16 ...
  * sampleid    (sampleid) int64 0 1 2 3 4 5 6 7 ... 620 621 622 623 624 625 626
    session     (neuroid) <U17 ...
    stimulus    (sampleid) <U119 ...
    subject     (neuroid) int64 ...
  * timeid      (timeid) int64 0
Attributes:
    measurement:  fmri
    modality:     text
    source:       /home/aalok/code/langbrainscore/data/Pereira_FirstSession_T...