grimbough / rhdf5

Package providing an interface between HDF5 and R
http://bioconductor.org/packages/rhdf5
61 stars 22 forks source link

`h5readAttribute()` does not handle attributes of type ENUM #125

Closed mtmorgan closed 1 year ago

mtmorgan commented 1 year ago

Using the same too-complicated example from here:

suppressPackageStartupMessages({
    library(cellxgenedp)
    library(zellkonverter)
})

h5ad_file <-
    files(db(overwrite = FALSE)) |>
    filter(file_id == "e2874fd9-af11-4adf-a47a-f380b7606434") |>
    files_download(dry.run = FALSE)

sce <- readH5AD(h5ad_file, use_hdf5 = TRUE, reader = "R")

results in

> sce <- readH5AD(h5ad_file, use_hdf5 = TRUE, reader = "R")
There were 28 warnings (use warnings() to see them)
> warnings()
Warning messages:
1: In H5Aread(A, ...) :
  Reading attribute data of type 'ENUM' not yet implemented. Values replaced by NA's.
...

or once the file is in hand...

> h5readAttributes(h5ad_file, "/obs/CellType") |> str()
List of 3
 $ encoding-type   : chr "categorical"
 $ encoding-version: chr "0.2.0"
 $ ordered         : num NA
Warning message:
In H5Aread(A, ...) :
  Reading attribute data of type 'ENUM' not yet implemented. Values replaced by NA's.

The problem is the ordered attribute, which is encoded as an ENUM, and the code path leading to it. https://github.com/grimbough/rhdf5/blob/bddaf01256ef2ba5c663c12589ff165565bad7f8/src/H5A.c#L328-L337

One perhaps non-rhdf5 issue is that the python 'convention' is to treat an attribute with keys 'TRUE', 'FALSE' as a logical vector; hdf5r does this too, but perhaps it doesn't belong at this level.

grimbough commented 1 year ago

The enum-attributes branch now has some experimental support for this.

> h5readAttributes(h5ad_file, "/obs/CellType") 
H5Tsize: 1
$`encoding-type`
[1] "categorical"

$`encoding-version`
[1] "0.2.0"

$ordered
[1] "FALSE"

At the moment, if it encounters an ENUM attribute it translates that into an R character vector; a bit like as.character() applied to a factor. In this case you could then apply as.logical() to get the desired behaviour.

Longer term, it might be better to define an enum class or something, so you don't lose the actual values. Would be interested in your thoughts.

mtmorgan commented 1 year ago

Thanks @grimbough ; I added a couple of comments to the commit.

I actually lean toward an enum with values 'TRUE' / 'FALSE' / (?NA) being coerced to logical; this is consistent with h5py's treatment of Boolean https://docs.h5py.org/en/stable/faq.html.

It seems like a close reading of https://support.hdfgroup.org/HDF5/doc/H5.user/DatatypesEnum.html would encourage a formal class to represent enum -- there are some subtleties that a casual user might get wrong. Not sure what the demand for this is at the moment...

grimbough commented 1 year ago

Thanks @grimbough ; I added a couple of comments to the commit.

Thanks for that. I think I've addressed them now.

I actually lean toward an enum with values 'TRUE' / 'FALSE' / (?NA) being coerced to logical; this is consistent with h5py's treatment of Boolean https://docs.h5py.org/en/stable/faq.html.

I agree. It certainly feels like h5readAttributes(), which is supposed to more "user friendly", should probably do something similar to h5py, so I've now implemented that. If the enum levels are a subset of c("FALSE", "TRUE", "NA") then the attribute values will be converted to logical before being returned.

For now the lower-level H5Aread() will return the character vector of enum labels as before, although ideally this would actually support the hypotheical H5_ENUM class.

It seems like a close reading of https://support.hdfgroup.org/HDF5/doc/H5.user/DatatypesEnum.html would encourage a formal class to represent enum -- there are some subtleties that a casual user might get wrong. Not sure what the demand for this is at the moment...

I suspect demand isn't great, although I wasn't aware of this issue either until you raised it in the hackathon document. Perhaps there are other gaps in support that no one has mentioned to me, but would be nice to fill :shrug:

mtmorgan commented 1 year ago

Hi @grimbough I'm not sure what happened to this, I can't find the branch anymore and I don't think it's been merged into 'devel'?

grimbough commented 1 year ago

Sorry @mtmorgan I renamed the branch to https://github.com/grimbough/rhdf5/tree/scverse-hackathon since I thought there might be more things that came up during the hackathon, but I ended up only interacting with the SpatialData/Zarr group for the day I was there.

mtmorgan commented 1 year ago

Thanks @grimbough ; I see

devtools::load_all()
## Error in add_classes_to_exports(ns = nsenv, package = package, exports = exports,  :
##   in package ‘rhdf5’ classes H5Enum were specified for export but not defined
grimbough commented 1 year ago

Thanks, an unintended legacy of starting that hypothetical enum class and then not getting very far. It should load for you now.

I'm not sure if I mentioned it anywhere else, but I also added support to h5read() for AnnData nullable arrays. There's an example at: https://github.com/grimbough/rhdf5/blob/e59a50ae1fb287755a1ad71a74524cbbdc022f58/tests/testthat/test_h5read.R#L160-L171

If you're actively experimenting with any these features I'll keep this scverse-hackathon branch for now, and merge into devel after you've done some testing and found the bugs.

rcannood commented 1 year ago

Hi @grimbough ! Thanks for making these changes!

I'm currently looking into getting the anndataR package submitted to Bioconductor, and to that end I'm tracking down any outstanding issues.

I just installed the scverse-hackathon branch I can confirm that the changes solve a lot of our issues!

Could these changes be merged into the devel version?

grimbough commented 1 year ago

Hi @rcannood ,

Thanks for the nudge. I'd totally lost track of whether these had already been merged. The scverse-hackathon branch has now been merged into devel and everything should be available from v 2.45.2.

Let me know if you run into any other problems, probably by opening another issue.