theislab / zellkonverter

Conversion between scRNA-seq objects
https://theislab.github.io/zellkonverter/
Other
146 stars 27 forks source link

Factor with missing levels lead to one-off errors #122

Open const-ae opened 1 month ago

const-ae commented 1 month ago

Hi,

I am currently trying to read an h5ad file that has a column called "condition" which is stored as a categorical (I think that means that it was saved with AnnData version < 0.7.0).

Currently the implementation in .read_dim_data takes the integers, converts them to a factor, and overrides the levels:

levels <- as.vector(rhdf5::h5read(file, file.path(path, "__categories", cat_name)))
out_cols[[cat_name]] <- factor(out_cols[[cat_name]])
levels(out_cols[[cat_name]]) <- levels

This code fails if one of the factor level is unused:

# The conditions were: A, C, C, A, A, A, A (i.e., no B)
out_cols <- list(condition = c(0, 2, 2, 0, 0, 0, 0))

levels <- c("a", "b", "c")
out_cols[["condition"]] <- factor(out_cols[["condition"]])
levels(out_cols[["condition"]]) <- levels

# After parsing: the conditions appears as 5x A and 2x B!!
table(out_cols[["condition"]])
#> 
#> a b c 
#> 5 2 0

Created on 2024-08-19 with reprex v2.1.0

The problem is very easy to miss because R silently replaces the levels even though the length of the old and new levels are different.


One way to fix the implementation would be to write:

out_cols[["condition"]] <- factor(levels[out_cols[["condition"]]+1L], levels = levels)

table(out_cols[["condition"]])
#> 
#> a b c 
#> 5 0 2

Created on 2024-08-19 with reprex v2.1.0

lazappi commented 1 month ago

Hi @const-ae

Thanks for the issue. It looks like you are using the R reader? I'm a bit rusty on how this works but I think you are probably right. Any chance you would be interested in submitting a PR with this change?

const-ae commented 1 month ago

It looks like you are using the R reader?

Yes exactly.

Any chance you would be interested in submitting a PR with this change?

Sorry, I currently don't have the bandwidth as I am trying to wrap up some revisions before moving to London next week.