Open lazappi opened 12 months ago
@grimbough Do you have an idea whether it would be possible to add this relatively soon? At the moment it is blocking a project I was hoping to submit soon.
Hi @lazappi
I'll have a think. The way we deal with a logical dataset is a bit of an unsatisfying hack. When one is written, inside the HDF5 file they're just 1 byte signed integers, and we store an attribute indicating they came from an R logical. If we detect that attribute when reading a dataset the integers are type as "logical" once they've been read into R.
We can't pull that same trick for an attribute. Would it be sufficient to cast your R logical to an R integer before writing? Do you have an example of the output you're looking for created in python? Reading the referenced issue, maybe an ENUM is actually what you want. Are the only allowed values TRUE
and FALSE
, or is NA
also possible?
Ok, thanks. This is for reading/writing H5AD files from R. The anndata spec says a "boolean valued field" but I think you might be correct that it is actually written as an ENUM
.
If I look at our example file written from Python in R I get:
#> rhdf5::h5readAttributes("inst/extdata/example.h5ad", "/uns/Category")
$`encoding-type`
[1] "categorical"
$`encoding-version`
[1] "0.2.0"
$ordered
[1] FALSE
But if I use h5ls
on the command line I can see more:
$> h5ls -r -v inst/extdata/example.h5ad
...
/uns/Category Group
Attribute: encoding-type scalar
Type: variable-length null-terminated UTF-8 string
Attribute: encoding-version scalar
Type: variable-length null-terminated UTF-8 string
Attribute: ordered scalar
Type: enum native signed char {
FALSE = 0
TRUE = 1
}
...
I guess I missed this because you get a logical value back in R and I'm still new to HDF5.
In that case, do you have a suggestion for how to get {rhdf5} to write an ENUM
attribute that looks like this?
Here's a first go. You'll have to use this branch for now: https://github.com/grimbough/rhdf5/tree/logical-attrs
You should then be able to create an ENUM
attribute like this:
library(rhdf5)
hdf5_file <- tempfile()
rhdf5::h5createFile(hdf5_file)
rhdf5::h5createGroup(hdf5_file, "group")
h5file <- rhdf5::H5Fopen(hdf5_file)
h5obj <- rhdf5::H5Gopen(h5file, "group")
## create an enum datatype and insert our two key:value pairs
tid <- H5Tenum_create(dtype_id = "H5T_NATIVE_UCHAR")
H5Tenum_insert(tid, name = "TRUE", value = 1L)
H5Tenum_insert(tid, name = "FALSE", value = 0L)
## create the attribute using our new data type
sid <- H5Screate()
aid <- H5Acreate(h5obj = h5obj, name = "logical", dtype_id = tid, h5space = sid)
## write the value
values_to_be_written <- TRUE
H5Awrite(aid, buf = values_to_be_written)
## tidy up
h5closeAll(aid, sid, h5obj, h5file)
system2("h5ls", args = c('-r', '-v', hdf5_file))
...
/group Group
Attribute: logical scalar
Type: enum native unsigned char {
TRUE = 1
FALSE = 0
}
...
I can't promise this is the final API. There's still some thought that needs to be applied, but this is hopefully enough for you to do some testing, and I'll update here if it changes.
It would be good to wrap this up within h5writeAttribute
, but I don't like that logical values will get treated differently depending on whether they're in a dataset or an attribute. However changing how logical datasets are written is likely to have more wide ranging consequences, so I'd need to do it carefully.
Thanks! I tried this and I think it works for our use (with minimal testing). Here is a helper function I wrote based on your code in case that's useful:
write_h5ad_boolean_attribute <- function(attr_value, h5obj, attr_name) {
## create an enum datatype and insert our two key:value pairs
tid <- rhdf5::H5Tenum_create(dtype_id = "H5T_NATIVE_UCHAR")
rhdf5::H5Tenum_insert(tid, name = "TRUE", value = ifelse(attr_value, 1L, 0L))
rhdf5::H5Tenum_insert(tid, name = "FALSE", value = ifelse(attr_value, 0L, 1L))
## create the attribute using our new data type
sid <- rhdf5::H5Screate()
aid <- rhdf5::H5Acreate(
h5obj = h5obj, name = attr_name, dtype_id = tid, h5space = sid
)
## write the value
values_to_be_written <- attr_value
rhdf5::H5Awrite(aid, buf = values_to_be_written)
## tidy up
rhdf5::h5closeAll(aid, sid)
}
In the process I realised we need to adapt how we write all boolean values (arrays etc.), not just attributes. Will something similar work for those?
Maybe there could be an argument or something controlling if boolean values are written as an ENUM
or the current way? I haven't looked into it but based on things other people have tried maybe {hdf5r} takes the ENUM
approach.
I think at the moment this will fail because of this block of code:
That will force the write to try and use the H5T_NATIVE_INT
datatype for a logical array, but we'd be sticking it into a dataset created with ENUM
and I don't think that'll work. I think it might be as simple as changing that to use the datatype of the existing dataset, but I'll have to do some testing.
Looks like I actually answered this once before https://stackoverflow.com/a/74653515/3869132 and didn't remember
I've made some updates to the package for writing logical attributes. You should now be able to do it with h5writeAttributes
and I've updated that function so it can be given a file path and dataset location within the file, rather than having to provide HDF5 handles and the tidy them yourself.
library(rhdf5)
hdf5_file <- tempfile()
rhdf5::h5createFile(hdf5_file)
rhdf5::h5createGroup(hdf5_file, "group")
values_to_be_written <- c(NA, FALSE, TRUE, FALSE, NA)
h5writeAttribute(values_to_be_written, h5obj = hdf5_file, name = "test", h5loc = "/group")
h5readAttributes(hdf5_file, name = "/group")
#> $test
#> [1] NA FALSE TRUE FALSE NA
This is currently in the devel branch here on GitHub. There some more outstanding issues I'll resolve before pushing to Bioc, but I think that API will remain static for now so feel free to work with it.
Thanks, this should make things much easier! I finally got around to testing it. Has anything else changed in h5writeAttribute()
/h5readAttribute()
that would change how things are stored/read into R? I used to get a vector when I read a character attribute but now I get a 1D array (on the logical-attrs
branch):
hdf5_file <- tempfile()
rhdf5::h5createFile(hdf5_file)
rhdf5::h5createGroup(hdf5_file, "group")
rhdf5::h5writeAttribute("string_attribute", h5obj = hdf5_file, name = "test_string", h5loc = "/group")
attrs <- rhdf5::h5readAttributes(hdf5_file, "/group")
class(attrs$test_string)
#> [1] "array"
dim(attrs$test_string)
#> [1] 1
is.vector(attrs$test_string)
#> [1] FALSE
I can probably worth with either way but wanted to check if this is what I should expect or something has snuck in somewhere.
I didn't deliberately change this to return a vector. To be honest I would have expected to get back an array, that's normally the default. This annoys me, because if you write a vector you generally get back an array, but I once tried to change it and broke HDF5Array / DelayedArray by doing that, so I reverted it.
I can take a look at what might have changed, but I'm probably happy with it returning an array for now and that is consistent with most other functions.
You can ignore my last comment (and the one I deleted). Turns out there was a bug in my code and asScalar
wasn't been set correctly. When you do set asScalar = TRUE
you get a vector back as I expected.
I think my original issue is now solved using the current devel version and the solutions you suggested.
Currently, it doesn't seem possible to write a logical type attribute to a group (using {rhdf5}
v2.47.0
):Would it be possible to implement this? Happy to help if you can point me in the right direction but I couldn't quite work out why it isn't working currently. Unless I am missing something?