theislab / zellkonverter

Conversion between scRNA-seq objects
https://theislab.github.io/zellkonverter/
Other
145 stars 27 forks source link

Began adding an R-native reader for H5AD files. #46

Closed LTLA closed 3 years ago

LTLA commented 3 years ago

This creates an R-native reader that directly uses rhdf5 and other BioC packages to read H5AD files. The general idea is to reduce the level of reliance on the basilisk bindings for the 95% of use cases, avoiding the various issues with Python communication; but still retain our current basilisk infrastructure as a fallback for difficult and weird cases.

Seems to work fairly well though some more testing is needed. Code is also required to handle uns.

lazappi commented 3 years ago

I don't have a lot of experience with HDF5 to be honest but this seems to work. Interestingly it is able to read the DE results in the example_anndata.h5ad file which fails to convert from Python with the current version. The format is a bit different so it would probably cause problems for scanpy if converted back to AnnData but still probably better than what we have now.

How do you think it would handle more Python specific stuff in the file? Do we need to add any other checks?

What other things still need to be done? I think at least changing readH5AD() to use this (and fall back to the Python version if it fails)?

LTLA commented 3 years ago

How do you think it would handle more Python specific stuff in the file? Do we need to add any other checks?

That's an interesting question. In theory, HDF5 should be language agnostic, so we wouldn't have to worry about Python-specific stuff if anndata saves things in a standard way. In practice, of course, one could dump all sorts of random crap in there (e.g., a pickled object) in which case we're screwed. I think the current strategy of tryCatch() will at least ensure that most of it works, and if something fails, we can go back and special case it later. Except for the pickles, no real way around that.

What other things still need to be done? I think at least changing readH5AD() to use this (and fall back to the Python version if it fails)?

Yes, and testing. Lots of testing. Throw it at every H5AD file you can get your hands on and make sure it gives reasonable results. We should even run this as part of the longtests if those H5AD files are publicly available - see e.g., DropletUtils.

ivirshup commented 3 years ago

In practice, of course, one could dump all sorts of random crap in there (e.g., a pickled object) in which case we're screwed.

The idea with the next version of anndata is that we have a registry of specifications. When you come across an element, you check in the registry of specifications to find the appropriate read method. If you can't find a method, a fallback is taken. By default, I think we'll go for a warning, but will have a strict mode to error.

You could do something similar? Again, this will be much easier to do with the next version of anndata. The question is whether you want to support older file versions as well.

Throw it at every H5AD file you can get your hands on and make sure it gives reasonable results.

Btw, would you be interested in sharing a resource for this? E.g. have some figshare or cloud storage volume that holds use cases for backwards compatibility tests?

lazappi commented 3 years ago

Btw, would you be interested in sharing a resource for this? E.g. have some figshare or cloud storage volume that holds use cases for backwards compatibility tests?

I am definitely up for that. I was going to ask around the lab for old/weird .h5ad files for testing but it makes sense to put them somewhere accessible.

In more general news I think I'm done with other minor updates. I would still like to do #47 but not sure I'll have time. The longtests infrastructure should be set up now so we can add files to that as we find them. I'll start looking into what needs to be done here but probably worth merging the current master first.

LTLA commented 3 years ago

Btw, would you be interested in sharing a resource for this? E.g. have some figshare or cloud storage volume that holds use cases for backwards compatibility tests?

Sorry for the late reply. This is a good idea and I've brought this up a few times. I use this in DropletTestFiles, holding versions of CellRanger output all the way back to v2; I don't trust 10X Genomics to give me stable URLs, hence my own copies.

It would be fairly easy to host these files on ExperimentHub; R users can pull them down via a dedicated package and Python users accessing the EHub REST API. I brought this up as a possibility for David Fischer's curation project; he ended up working with the CZI on that, but we can recycle this idea for weird old files that mightn't have a home anywhere else.

ivirshup commented 3 years ago

It would be fairly easy to host these files on ExperimentHub; R users can pull them down via a dedicated package and Python users accessing the EHub REST API.

That sounds good. I'm looking into what we can do for hosting, but if you've got something already set up that would be great. I'm not familiar with the organization of ExperimentHub, how much structure would we need to observe there?

I think there's also a question of scope, and what constitutes "test files". For example, I've been wanting a place to host larger datasets for benchmarking.

LTLA commented 3 years ago

I'm not familiar with the organization of ExperimentHub, how much structure would we need to observe there?

Not much. You can throw any files on there, provided they are appropriately decorated with metadata. Metadata is formatted like this and needs to be associated with a package. Ideally we would have a separate package like H5ADTestFiles or something appropriately named; I would offer to have scRNAseq be the temporary host package for these files, but given that we're so close to the release, it's unlikely that our request to add files to scRNAseq would be processed in time, so we might as well just wait.

In the meantime, I suppose @lazappi can just gather his ragtag bunch of files and do testing locally.

lazappi commented 3 years ago

I got the dates mixed up and didn't realise final changes for this release are supposed to be today. I don't think there is time to test this properly but it seems to mostly work. At the moment it's available as an option but the Python version is still the default. We could merge it and iron out the kinks later (maybe with a note/warning somewhere about it still being experimental)? @LTLA do you think it's worth doing that or should we just put it off until the next release when it should be more stable?

LTLA commented 3 years ago

At the moment it's available as an option but the Python version is still the default. We could merge it and iron out the kinks later (maybe with a note/warning somewhere about it still being experimental)?

SGTM

ivirshup commented 3 years ago

Ideally we would have a separate package like H5ADTestFiles

I think it makes more sense to start of with something less structured. It will help to scope out the requirements of what kinds of files we collect. I also think this should probably ultimately be handled by anndata since we're going to have more complex requirements (e.g. not just hdf5, benchmarking datasets, adding files on our release schedule).

LTLA commented 3 years ago

Whatever floats your boat, I'm fine with; as long as we have stable URLs to pull for zellkonverter's longtests.

(@lazappi in fact it occurs to me that we could just associate zellkonverter itself with these files, thus avoiding the need to have a whole other package. As long as we only Suggests: ExperimentHub, it shouldn't add to the dependency burden. I was also doing this for reference datasets in SingleR before I split them off into a separate package.)