Can't save h5mu from Scirpy processed gex+bcr+tcr data if I copy airr into obs

Ngort commented 1 year ago

Describe the bug

Can't save h5mu from Scirpy processed gex+bcr+tcr data if I copy airr into obs (i.e. tdata.obs = tdata.obs.join(ir.get.airr(tdata, tdata.obsm['airr'].fields))). Unlike in #427 , I am on 0.13 and still suffer from the bug.

TypeError: Can't implicitly convert non-string objects to strings
Above error raised while writing key 'VJ_1_germline_alignment' of <class 'h5py._hl.group.Group'> to /

(it does this with many other columns, including all _call, _cigar columns)

To Reproduce

mdata = mu.MuData({'gex':adata.copy(),
       'tcr':tdata.copy(),
       'bcr':bdata.copy()})

ir.tl.chain_qc(mdata['tcr'])
ir.pp.ir_dist(mdata['tcr'], metric="hamming", sequence='nt', n_jobs=4, cutoff=20, key_added='ir_dist_nt_hamming_global')
ir.pp.ir_dist(mdata['tcr'], metric="identity", sequence='nt', n_jobs=4)

ir.tl.define_clonotypes(tdata, key_added='clone_id',
                        n_jobs=4, dual_ir='all', receptor_arms='all',
                        within_group=['receptor_type', 'donor_id_global'])

mdata['tcr'].obs = mdata['tcr'].obs.join(ir.get.airr(mdata['tcr'], mdata['tcr'].obsm['airr'].fields))

mdata.write(fname)

What else I've tried Changing columns to categoricals

for mod in mdata.mod.keys():
    for col in mdata[mod].obs.columns:
        if re.findall(r'(V(?:D)?J_\d_\w_(?:call|cigar))', col):
            mdata[mod].obs[col] = mdata[mod].obs[col].astype('category')
            print(mod,':',col, sep='')

    mdata.update()

Expected behaviour Save the file without problems

System

OS: Linux Python version 3.9.16 Versions of libraries involved [Muon 0.1.5, Scirpy 0.13.0, Scanpy 1.9.3]

Additional context Add any other context about the problem here.

grst commented 1 year ago

The problem is that AnnData cannot deal with None values in obs. A minimal repex is

import anndata
import pandas as pd
import numpy as np

adata = anndata.AnnData(X=None, obs=pd.DataFrame().assign(test=np.array([1, 2, None, 3])))
adata.write_h5ad("test.h5ad")

In principle, AnnData supports nullable Integers and Booleans, but not Strings (see https://github.com/scverse/anndata/issues/679, https://github.com/scverse/anndata/issues/504). However, nullable here means a pandas BooleanArray or IntegerArray, not an object dtype with Nones.

As a workaround, the offending columns can be converted to a pandas array, e.g.

mdata['tcr'].obs["VJ_1_consensus_count"] = pd.array(mdata['tcr'].obs["VJ_1_consensus_count"].values)

We obviously need a better solution than this. I'll check if this should be solved on the AnnData side e.g. by an automatic conversion. Otherwise the scirpy.get.airr function could deal with that.

grst commented 3 months ago

some progress on anndata https://github.com/scverse/anndata/pull/1558

Still need to check if this can be closed now.

grst commented 3 weeks ago

Still need to check if this can be closed now.

Unfortunately not.

Depends on https://github.com/scverse/anndata/issues/1068

scverse / scirpy

Can't save h5mu from Scirpy processed gex+bcr+tcr data if I copy airr into obs #434