scverse / mudata

Multimodal Data (.h5mu) implementation for Python
https://mudata.rtfd.io
BSD 3-Clause "New" or "Revised" License
75 stars 17 forks source link

len #26

Closed rogershijin closed 1 year ago

rogershijin commented 2 years ago

Is there a reason MuData doesn't support len?

And if not would you consider adding some subclass that does? For example,

class MuDataWithLen(MuData):

    def __len__(self):
        try:
            return self._len
        except:
            self._len = min(len(mod) for mod in self.mod.values())
            return self._len

Thanks a lot!

gtca commented 2 years ago

Hey @rogershijin, thanks for the feedback, no particular reason but that raises an interesting question: should it be n_mod or n_obs as the length? I would expect len(object) to match object.shape[0], and currently the shape of MuData is (n_obs, n_var).

rogershijin commented 2 years ago

Thanks for getting back to me! Yea I think for me also it is more intuitive for this to be n_obs.

ivirshup commented 2 years ago

@gtca I'm not sure it makes sense for something to have __len__ without __iter__. Also defining __len__ implicitly defines __bool__.

gtca commented 2 years ago

@ivirshup maybe but see anndata.

We can add __iter__ here as well. Is there anything special with the way anndata does it? I don't see __iter__ there per se.

ivirshup commented 2 years ago

I'm not sure anndata should have len either 😅

ivirshup commented 2 years ago

I think it needs some consideration of why you want the length, and what it should be consistent with. I do think if it were defined n_obs makes the most sense, but accessing n_obs or shape seems like probably the better approach.

gtca commented 2 years ago

I think it's ok to expect objects that have .shape to have their length defined as .shape[0].

Practically, for the workflows as in the scanpy/muon world, one should rather use .n_obs as explicit is better than implicit.

rogershijin commented 2 years ago

Thanks for adding __len__! And I should've mentioned this in the issue description, but on the practical application side I was originally motivated to post this because I couldn't use a PyTorch dataloader to batch load a MuData object without __len__ being defined.