theislab / zellkonverter

Conversion between scRNA-seq objects
https://theislab.github.io/zellkonverter/
Other
144 stars 27 forks source link

Fix categorical (factor) handling in native R reader #86

Closed jackkamm closed 1 year ago

jackkamm commented 1 year ago

When I try to use the native R reader on an h5ad file I have, like so:

sce <- readH5AD("/path/to/my/anndata.h5ad", 
                reader='R',
                use_hdf5=TRUE,
                verbose=TRUE)

I get the following warning:

Warning messages:
1: In value[[3L]](cond) : setting 'colData' failed for
  '/path/to/my/anndata.h5ad':
  cannot coerce class "list" to a DataFrame
2: In value[[3L]](cond) : setting 'rowData' failed for
  '/path/to/my/anndata.h5ad':
  cannot coerce class "list" to a DataFrame

And the colData and rowData of the returned SCE are empty.

Tracking it down, it's because zellkonverter was unable to convert some Categorical (factor) columns in the obs/var. It's because AnnData 0.8 changed the way Categoricals are encoded.

This updates the native R reader so it can read factors encoded with the newer format. After applying this patch I was able to read my h5ad file.

EDIT:

Related: https://github.com/theislab/zellkonverter/issues/78

Thought I should also add why I'm using the R reader rather than the recommended python reader. It's because the python reader had errors reading X/layers matrices from my h5ad file due to unsorted indptr (seems to be the same issue as https://github.com/theislab/anndata2ri/issues/51). I was able to fix some of the errors by calling sort_indices() and resaving the AnnData, but some matrices still wouldn't load, and the ones that did were loaded as dgCMatrix instead of DelayedArray. By contrast I found the R reader successfully read all matrices from my h5ad file as DelayedArray.

LTLA commented 1 year ago

@lazappi will have more comments, but one thing I'll note is that the change must be back-compatible with the previous H5AD formats. I'm not familiar with the differences but it sounds like __categories no longer exists in the new format, in which case you could check for that in names(fields) and switch to your approach if it doesn't exist. This avoids weird interactions between old versions of the files that happen to use the new keywords.

(A better approach would be to detect the file version up-front and pass it along to each internal function, which avoids the need for guessing inside each function. However, this would involve a more dramatic set of changes.)

Also, I have to say it, but the surrounding code uses 4-space indenting, and so should you.

jackkamm commented 1 year ago

Also, I have to say it, but the surrounding code uses 4-space indenting, and so should you.

Thanks for catching that, fixed now.

but one thing I'll note is that the change must be back-compatible with the previous H5AD formats

I think the changes should be back-compatible. I did not delete the handling for the old categories format, but simply added code to handle factors encoded in the new format. Theoretically I think it could even handle an h5ad that contains a mix of factors encoded in both old & new formats (not that such a scenario would happen).

I could switch to an explicit if/else instead if preferable, perhaps by checking if the __categories key exists, or maybe there's a better way to determine which format the AnnData is in?

EDIT: I went ahead and updated the code to wrap it in an if/else that checks if the __categories key exists.

lazappi commented 1 year ago

Hi @jackkamm. Thanks for the contribution! The R reader is still a work in progress but I know some people have tried using it so this will be appreciated. I just have a couple of points:

jackkamm commented 1 year ago

There is now a written spec for the anndata 0.8 H5AD format https://anndata.readthedocs.io/en/latest/fileformat-prose.html. If you could check that everything is consistent that would be great.

Thanks for the reference -- looks like more work needs to be done to support nullable booleans and ints as per the spec. I will look into it.

It would be great to have at least one test for this (either a normal test or a long test). The simplest thing might be to write a file with the Python 0.8 writer and see if it can be read with the R reader (making sure it has the right kind of columns).

Sounds good, I've added a test, still a work-in-progress since it needs to check the nullable booleans mentioned above.

Do you think this solves https://github.com/theislab/zellkonverter/issues/78 as well or is that a slightly different issue?

I think it's basically the same issue, I was getting the same error message, and the PR fixed it for me (as in I was able to read in the h5ad). But the original PR only fixed a subset of the problem (factors/categories), a little more work needs to be done to make sure other types are also converted correctly.

lazappi commented 1 year ago

šŸ‘šŸ» It sounds like you are still working on some things which is great, just ping me when you are ready for a review

jackkamm commented 1 year ago

I've added handling for nullable ints and bools, so I believe the implementation satisfies the v0.8 spec now [1], and is ready for a review.

For the test, I found that writeH5AD() wasn't quite consistent with the spec [2], therefore I manually created a separate AnnData v0.8 object in Python [3], which I saved to the git repo.

Footnotes:

[1] The spec also mentions string handling, but I didn't test it because AnnData.write always converts strings to factors: https://github.com/scverse/anndata/blob/8e793af01a77d0e31e91a72f3988df7d6de9cdc5/anndata/_io/h5ad.py#L58

[2] writeH5AD seems to convert NA_integer_ to -2^31, instead of using a mask as in the spec.

[3] Here is the Python code I used to create krumsiek11_augmented_v0-8.h5ad: https://gist.github.com/jackkamm/3b606d15d83063ed8e5f03ae1c7ab928

lazappi commented 1 year ago

I've added handling for nullable ints and bools, so I believe the implementation satisfies the v0.8 spec now [1], and is ready for a review.

For the test, I found that writeH5AD() wasn't quite consistent with the spec [2], therefore I manually created a separate AnnData v0.8 object in Python [3], which I saved to the git repo.

I think we need to look into what is going on with the Python reader, I'm not sure about the environment business... That can probably be a separate PR though, but if you want to open an issue about it that would be helpful.

Footnotes:

[1] The spec also mentions string handling, but I didn't test it because AnnData.write always converts strings to factors: scverse/anndata@8e793af/anndata/_io/h5ad.py#L58

I don't think it always happens, but yes, most of the time it does.

[2] writeH5AD seems to convert NA_integer_ to -2^31, instead of using a mask as in the spec.

Another thing to check...

[3] Here is the Python code I used to create krumsiek11_augmented_v0-8.h5ad: gist.github.com/jackkamm/3b606d15d83063ed8e5f03ae1c7ab928

This should be added to inst/scripts as a description of the dataset. If you could add a comment with the important package versions that would be great as well.

jackkamm commented 1 year ago

Thanks for the helpful reviews :) I might be a little slow getting to it due to some other deadlines right now, but hopefully should respond & have this done later next week

jackkamm commented 1 year ago

I've revised this PR now based on the feedback. I also squashed all the commits and force pushed.

Biggest changes are:

Also in regard to this:

I don't think it [string to factor conversion] always happens, but yes, most of the time it does.

Seems like the conversion doesn't happen when the string values are all unique. I added such a column of unique strings to the test data. rhdf5 seems able to read it without any issue.

jackkamm commented 1 year ago

Another minor comment I forgot to add:

[2] writeH5AD seems to convert NAinteger to -2^31, instead of using a mask as in the spec. Another thing to check...

This might be a relatively minor issue. When I tested it before, rhdf5 and h5py both converted the -2^31 to NA/nan when reading it into R/python, so users may not notice the issue in practice.

jackkamm commented 1 year ago

Thinking about it some more, I may have been confused about the "version" argument, and maybe it was a mistake to try to use it at all.

After all the python AnnData package is able to read in older AnnData just fine without needing to specify the version. So we should too.

As it stands now, the PR won't properly read AnnData v0.7 when version=NULL, because the old-style conversion for categories only happens when we explicitly specify a version < 0.8.

I'm pretty sure if I remove the compareVersion calls, the PR should be compatible with both old and new AnnData versions automatically.

Let me know if you want me to revert the explicit version handling and I'll update the PR accordingly.

lazappi commented 1 year ago

So, in the Python reader the version is there to control which environment is used (and therefore which AnnData file version is read/written). In the R reader we aren't messing around with environments so it maybe isn't needed. It just depends whether it's easier to detect what file version has been used or ask the user to specify it.

If we were looking at a writer it would be a bit different because in that case we would want to give the user control over which file version is written.

jackkamm commented 1 year ago

I made 1 more commit, so that the version isn't passed into native R reader anymore. Instead, native R reader just tries to figure out the right thing to do based on the keywords/attributes it sees, like in the original version of this PR.

I think it's probably better this way, so the user doesn't need to explicitly specify the version.

But I can see the argument the other way also. So I leave this last change as a separate unsquashed commit -- feel free to revert it if you prefer the previous approach that explicitly passes in the version.

lazappi commented 1 year ago

Thanks for all your work on this! I suspect it might need some tweaks in the future but I have merged what we have so far.