grimbough / rhdf5

Package providing an interface between HDF5 and R
http://bioconductor.org/packages/rhdf5
59 stars 22 forks source link

Allowing writing of logical attributes #136

Open lazappi opened 7 months ago

lazappi commented 7 months ago

Currently, it doesn't seem possible to write a logical type attribute to a group (using {rhdf5} v2.47.0):

hdf5_file <- tempfile()
rhdf5::h5createFile(hdf5_file)
rhdf5::h5createGroup(hdf5_file, "group")
h5file <- rhdf5::H5Fopen(hdf5_file)
h5obj <- rhdf5::H5Gopen(h5file, "group")
rhdf5::h5writeAttribute(TRUE, h5obj, "logical", asScalar = TRUE)
#> Error in h5createAttribute(h5obj, name, dims = dims, storage.mode = storagemode,  : 
#>  datatype logical not yet implemented. Try 'double', 'integer', or 'character'.

Would it be possible to implement this? Happy to help if you can point me in the right direction but I couldn't quite work out why it isn't working currently. Unless I am missing something?

lazappi commented 5 months ago

@grimbough Do you have an idea whether it would be possible to add this relatively soon? At the moment it is blocking a project I was hoping to submit soon.

grimbough commented 5 months ago

Hi @lazappi

I'll have a think. The way we deal with a logical dataset is a bit of an unsatisfying hack. When one is written, inside the HDF5 file they're just 1 byte signed integers, and we store an attribute indicating they came from an R logical. If we detect that attribute when reading a dataset the integers are type as "logical" once they've been read into R.

We can't pull that same trick for an attribute. Would it be sufficient to cast your R logical to an R integer before writing? Do you have an example of the output you're looking for created in python? Reading the referenced issue, maybe an ENUM is actually what you want. Are the only allowed values TRUE and FALSE, or is NA also possible?

lazappi commented 5 months ago

Ok, thanks. This is for reading/writing H5AD files from R. The anndata spec says a "boolean valued field" but I think you might be correct that it is actually written as an ENUM.

If I look at our example file written from Python in R I get:

#> rhdf5::h5readAttributes("inst/extdata/example.h5ad", "/uns/Category")
$`encoding-type`
[1] "categorical"

$`encoding-version`
[1] "0.2.0"

$ordered
[1] FALSE

But if I use h5ls on the command line I can see more:

$> h5ls -r -v inst/extdata/example.h5ad

...
/uns/Category            Group
    Attribute: encoding-type scalar
        Type:      variable-length null-terminated UTF-8 string
    Attribute: encoding-version scalar
        Type:      variable-length null-terminated UTF-8 string
    Attribute: ordered scalar
        Type:      enum native signed char {
                   FALSE            = 0
                   TRUE             = 1
               }
...

I guess I missed this because you get a logical value back in R and I'm still new to HDF5.

In that case, do you have a suggestion for how to get {rhdf5} to write an ENUM attribute that looks like this?

grimbough commented 5 months ago

Here's a first go. You'll have to use this branch for now: https://github.com/grimbough/rhdf5/tree/logical-attrs

You should then be able to create an ENUM attribute like this:

library(rhdf5)

hdf5_file <- tempfile()
rhdf5::h5createFile(hdf5_file)
rhdf5::h5createGroup(hdf5_file, "group")
h5file <- rhdf5::H5Fopen(hdf5_file)
h5obj <- rhdf5::H5Gopen(h5file, "group")

## create an enum datatype and insert our two key:value pairs
tid <- H5Tenum_create(dtype_id = "H5T_NATIVE_UCHAR")
H5Tenum_insert(tid, name = "TRUE", value = 1L)
H5Tenum_insert(tid, name = "FALSE", value = 0L)

## create the attribute using our new data type
sid <- H5Screate()
aid <- H5Acreate(h5obj = h5obj, name = "logical", dtype_id = tid, h5space = sid)

## write the value
values_to_be_written <- TRUE
H5Awrite(aid, buf = values_to_be_written)

## tidy up
h5closeAll(aid, sid, h5obj, h5file)

system2("h5ls", args = c('-r', '-v', hdf5_file))
...
/group                   Group
    Attribute: logical scalar
        Type:      enum native unsigned char {
                   TRUE             = 1
                   FALSE            = 0
               }
...

I can't promise this is the final API. There's still some thought that needs to be applied, but this is hopefully enough for you to do some testing, and I'll update here if it changes.

It would be good to wrap this up within h5writeAttribute, but I don't like that logical values will get treated differently depending on whether they're in a dataset or an attribute. However changing how logical datasets are written is likely to have more wide ranging consequences, so I'd need to do it carefully.

lazappi commented 5 months ago

Thanks! I tried this and I think it works for our use (with minimal testing). Here is a helper function I wrote based on your code in case that's useful:

write_h5ad_boolean_attribute <- function(attr_value, h5obj, attr_name) {

  ## create an enum datatype and insert our two key:value pairs
  tid <- rhdf5::H5Tenum_create(dtype_id = "H5T_NATIVE_UCHAR")
  rhdf5::H5Tenum_insert(tid, name = "TRUE", value = ifelse(attr_value, 1L, 0L))
  rhdf5::H5Tenum_insert(tid, name = "FALSE", value = ifelse(attr_value, 0L, 1L))

  ## create the attribute using our new data type
  sid <- rhdf5::H5Screate()
  aid <- rhdf5::H5Acreate(
    h5obj = h5obj, name = attr_name, dtype_id = tid, h5space = sid
  )

  ## write the value
  values_to_be_written <- attr_value
  rhdf5::H5Awrite(aid, buf = values_to_be_written)

  ## tidy up
  rhdf5::h5closeAll(aid, sid)
}

In the process I realised we need to adapt how we write all boolean values (arrays etc.), not just attributes. Will something similar work for those?

Maybe there could be an argument or something controlling if boolean values are written as an ENUM or the current way? I haven't looked into it but based on things other people have tried maybe {hdf5r} takes the ENUM approach.

grimbough commented 5 months ago

I think at the moment this will fail because of this block of code:

https://github.com/grimbough/rhdf5/blob/4b84fb00512b246d364b474b221fbe497c473ea1/src/H5D.c#L1057-L1063

That will force the write to try and use the H5T_NATIVE_INT datatype for a logical array, but we'd be sticking it into a dataset created with ENUM and I don't think that'll work. I think it might be as simple as changing that to use the datatype of the existing dataset, but I'll have to do some testing.

grimbough commented 5 months ago

Looks like I actually answered this once before https://stackoverflow.com/a/74653515/3869132 and didn't remember

grimbough commented 5 months ago

I've made some updates to the package for writing logical attributes. You should now be able to do it with h5writeAttributes and I've updated that function so it can be given a file path and dataset location within the file, rather than having to provide HDF5 handles and the tidy them yourself.

library(rhdf5)
hdf5_file <- tempfile()
rhdf5::h5createFile(hdf5_file)
rhdf5::h5createGroup(hdf5_file, "group")

values_to_be_written <-  c(NA, FALSE, TRUE, FALSE, NA)
h5writeAttribute(values_to_be_written, h5obj = hdf5_file, name = "test", h5loc = "/group")

h5readAttributes(hdf5_file, name = "/group")
#> $test
#> [1]    NA FALSE  TRUE FALSE    NA

This is currently in the devel branch here on GitHub. There some more outstanding issues I'll resolve before pushing to Bioc, but I think that API will remain static for now so feel free to work with it.

lazappi commented 4 months ago

Thanks, this should make things much easier! I finally got around to testing it. Has anything else changed in h5writeAttribute()/h5readAttribute() that would change how things are stored/read into R? I used to get a vector when I read a character attribute but now I get a 1D array (on the logical-attrs branch):

hdf5_file <- tempfile()
rhdf5::h5createFile(hdf5_file)
rhdf5::h5createGroup(hdf5_file, "group")
rhdf5::h5writeAttribute("string_attribute", h5obj = hdf5_file, name = "test_string", h5loc = "/group")
attrs <- rhdf5::h5readAttributes(hdf5_file, "/group")
class(attrs$test_string)
#> [1] "array"
dim(attrs$test_string)
#> [1] 1
is.vector(attrs$test_string)
#> [1] FALSE

I can probably worth with either way but wanted to check if this is what I should expect or something has snuck in somewhere.

grimbough commented 4 months ago

I didn't deliberately change this to return a vector. To be honest I would have expected to get back an array, that's normally the default. This annoys me, because if you write a vector you generally get back an array, but I once tried to change it and broke HDF5Array / DelayedArray by doing that, so I reverted it.

I can take a look at what might have changed, but I'm probably happy with it returning an array for now and that is consistent with most other functions.

lazappi commented 4 months ago

You can ignore my last comment (and the one I deleted). Turns out there was a bug in my code and asScalar wasn't been set correctly. When you do set asScalar = TRUE you get a vector back as I expected.

I think my original issue is now solved using the current devel version and the solutions you suggested.