theislab / scib

Benchmarking analysis of data integration tools
MIT License
294 stars 63 forks source link

runPP with large file fails at `integration_prepare` Rds file conversion. #276

Closed ilibarra closed 2 years ago

ilibarra commented 2 years ago

Running R methods through scIB with a large h5ad file fails at the earliest steps (~2 million cells). The error I get seems to be at the integration_prepare step, script runPP.py due to a parsing of the anndata object into the R environment.

[Sat Oct 23 14:07:48 2021]
Job 0: 
        Preparing adata
        wildcards: hvg=HVG.1K,prep=RDS,scaling=unscaled,scenario=scenario
        parameters: batch_donor_dataset 1000  -r -l conda run -n scIB-python python
        output: adata_pre.RDS

        conda run -n scIB-python python scripts/runPP.py -i input.h5ad -o adata_pre.RDS -b batch_donor_dataset         --hvgs 1000  -r -l

ERROR conda.cli.main_run:execute(32): Subprocess for 'conda run ['python', 'scripts/runPP.py', '-i', 'input.h5ad', '-o', 'adata_pre.RDS', '-b', 'batch_donor_dataset', '--hvgs', '1000', '-r', '-l']' command failed.  (See above for error)
~/envs/scIB-python/lib/python3.7/site-packages/rpy2/robjects/pandas2ri.py:14: FutureWarning: pandas.core.index is deprecated and will be removed in a future version.  The public classes are available in the top-level namespace.
  from pandas.core.index import Index as PandasIndex
Traceback (most recent call last):
  File "~/envs/scIB-python/lib/python3.7/site-packages/rpy2/rinterface_lib/sexp.py", line 368, in from_object
    mv = memoryview(obj)
TypeError: memoryview: a bytes-like object is required, not 'tuple'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "scripts/runPP.py", line 72, in <module>
    runPP(file, out, hvg, batch, rout, scale, seurat)
  File "scripts/runPP.py", line 43, in runPP
    scIB.preprocessing.saveSeurat(adata, outPath, batch, hvgs)
  File "~/envs/scIB-python/lib/python3.7/site-packages/scIB/preprocessing.py", line 488, in saveSeurat
    ro.globalenv['adata'] = adata
  File "~/envs/scIB-python/lib/python3.7/site-packages/rpy2/robjects/environments.py", line 32, in __setitem__
    robj = conversion.converter.py2rpy(value)
  File "~/envs/scIB-python/lib/python3.7/functools.py", line 840, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
  File "~/envs/scIB-python/lib/python3.7/site-packages/anndata2ri/py2r.py", line 55, in py2rpy_anndata
    x = {} if obj.X is None else dict(X=mat_converter.py2rpy(obj.X.T))
  File "~/envs/scIB-python/lib/python3.7/functools.py", line 840, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
  File "~/envs/scIB-python/lib/python3.7/site-packages/anndata2ri/scipy2ri/py2r.py", line 43, in wrapper
    return f(obj)
  File "~/envs/scIB-python/lib/python3.7/site-packages/anndata2ri/scipy2ri/py2r.py", line 54, in csc_to_rmat
    i=as_integer(csc.indices),
  File "~/envs/scIB-python/lib/python3.7/site-packages/rpy2/robjects/functions.py", line 192, in __call__
    .__call__(*args, **kwargs))
  File "~/envs/scIB-python/lib/python3.7/site-packages/rpy2/robjects/functions.py", line 113, in __call__
    new_args = [conversion.py2rpy(a) for a in args]
  File "~/envs/scIB-python/lib/python3.7/site-packages/rpy2/robjects/functions.py", line 113, in <listcomp>
    new_args = [conversion.py2rpy(a) for a in args]
  File "~/envs/scIB-python/lib/python3.7/functools.py", line 840, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
  File "~/envs/scIB-python/lib/python3.7/site-packages/rpy2/robjects/numpy2ri.py", line 87, in numpy2rpy
    res = _numpyarray_to_r(o, _kinds[o.dtype.kind])
  File "~/envs/scIB-python/lib/python3.7/site-packages/rpy2/robjects/numpy2ri.py", line 60, in _numpyarray_to_r
    dim = ro.vectors.IntVector(a.shape)
  File "~/envs/scIB-python/lib/python3.7/site-packages/rpy2/robjects/vectors.py", line 413, in __init__
    super().__init__(obj)
  File "~/envs/scIB-python/lib/python3.7/site-packages/rpy2/rinterface_lib/sexp.py", line 288, in __init__
    super().__init__(type(self).from_object(obj).__sexp__)
  File "~/envs/scIB-python/lib/python3.7/site-packages/rpy2/rinterface_lib/sexp.py", line 372, in from_object
    res = cls.from_iterable(obj)
  File "~/envs/scIB-python/lib/python3.7/site-packages/rpy2/rinterface_lib/conversion.py", line 28, in _
    cdata = function(*args, **kwargs)
  File "~/envs/scIB-python/lib/python3.7/site-packages/rpy2/rinterface_lib/sexp.py", line 319, in from_iterable
    r_vector)
  File "~/envs/scIB-python/lib/python3.7/site-packages/rpy2/rinterface_lib/sexp.py", line 302, in _populate_r_vector
    cls._CAST_IN)
  File "~/envs/scIB-python/lib/python3.7/site-packages/rpy2/rinterface_lib/sexp.py", line 239, in _populate_r_vector
    set_elt(r_vector, i, cast_value(v))
**OverflowError: integer 3068178618 does not fit 'int'**

From OverflowError: integer 3068178618 does not fit 'int' I suspect this could be an overflow R error that stops the conversion.

The current AnnData object is n_obs × n_vars = 1907304 × 14260. adata.X is a sparse matrix of float32. The number 3068178618 is the shape of the non-zero entries in the sparse adata.X. I removed the adata.layers['counts'] to attempt saving space but still no success. I think this is just a borderline R session memory error.

Any hints on whether this can be solved by increasing memory requirements for R, or bypassed by running externally? I am using the version from Apr 14, so it was maybe handled in recent versions. As this could be a borderline case, maybe wondering if one can solve this by only saving HVGs and only scIB pipeline variants with HVGs.

commit 284244693087c53720bbc53a3b634d636c2d040b (HEAD -> master, origin/master, origin/HEAD)
Author: LuckyMD <m.d.luecken@gmail.com>
Date:   Wed Apr 14 12:30:42 2021 +0200

Thanks for any input on this topic.

lazappi commented 2 years ago

The theoretical limit for the number of elements in a normal R matrix is 2147483647 so I'm guessing that might be the issue. There are probably ways around it with sparse matrices or disk-backed stuff but your HVG idea is probably the quickest/easiest.

ilibarra commented 2 years ago

Thanks for the input! I am now keeping only HVGs by subseting them before calling the saveSeurat function.

Now, while trying to save I am getting the following error.

invalid class “dgCMatrix” object: row indices are not sorted within columns

I think this time it is related to sorting in the indices of the anndata matrices object. I have verified that before conversion following the tip in https://github.com/theislab/anndata2ri/pull/52. Currently the conversion error is the same.

adata = sc.read(inPath)
adata.X.has_sorted_indices
adata.X.sort_indices()
adata.X.has_sorted_indices

it prints

0 # sorting not done
True # sorting done.

Before calling the saveSeurat function.

    adata_hvg = adata[:,adata.var.index.isin(hvgs)]
    print('has sorted indexes?', adata_hvg.X.has_sorted_indices)
    scIB.preprocessing.saveSeurat(adata_hvg, outPath, batch, hvgs)

it prints

1

Type of adata.X is <class 'anndata._core.views.SparseCSRView'>. Mentioning this in case this could be related to the conversion problem.

---------------------------------------------------------------------------
RRuntimeError                             Traceback (most recent call last)
<ipython-input-111-605d72dcdc60> in <module>
     26     print('has raw', adata_hvg.raw)
     27 
---> 28     scIB.preprocessing.saveSeurat(adata_hvg, outPath, batch, hvgs)
     29     # scIB.preprocessing.saveSeurat(adata, outPath, batch, hvgs)
     30 

~/miniconda3/envs/scIB-python/lib/python3.7/site-packages/scIB/preprocessing.py in saveSeurat(adata, path, batch, hvgs)
    486                 adata.layers[key].sort_indices()
    487 
--> 488     ro.globalenv['adata'] = adata
    489 
    490     ro.r('sobj = as.Seurat(adata, counts="counts", data = "X")')

~/miniconda3/envs/scIB-python/lib/python3.7/site-packages/rpy2/robjects/environments.py in __setitem__(self, item, value)
     30 
     31     def __setitem__(self, item, value) -> None:
---> 32         robj = conversion.converter.py2rpy(value)
     33         super(Environment, self).__setitem__(item, robj)
     34 

~/miniconda3/envs/scIB-python/lib/python3.7/functools.py in wrapper(*args, **kw)
    838                             '1 positional argument')
    839 
--> 840         return dispatch(args[0].__class__)(*args, **kw)
    841 
    842     funcname = getattr(func, '__name__', 'singledispatch function')

~/miniconda3/envs/scIB-python/lib/python3.7/site-packages/anndata2ri/py2r.py in py2rpy_anndata(obj)
     54         # TODO: sparse
     55         x = {} if obj.X is None else dict(X=mat_converter.py2rpy(obj.X.T))
---> 56         layers = {k: mat_converter.py2rpy(v.T) for k, v in obj.layers.items()}
     57         assays = ListVector({**x, **layers})
     58 

~/miniconda3/envs/scIB-python/lib/python3.7/site-packages/anndata2ri/py2r.py in <dictcomp>(.0)
     54         # TODO: sparse
     55         x = {} if obj.X is None else dict(X=mat_converter.py2rpy(obj.X.T))
---> 56         layers = {k: mat_converter.py2rpy(v.T) for k, v in obj.layers.items()}
     57         assays = ListVector({**x, **layers})
     58 

~/miniconda3/envs/scIB-python/lib/python3.7/functools.py in wrapper(*args, **kw)
    838                             '1 positional argument')
    839 
--> 840         return dispatch(args[0].__class__)(*args, **kw)
    841 
    842     funcname = getattr(func, '__name__', 'singledispatch function')

~/miniconda3/envs/scIB-python/lib/python3.7/site-packages/anndata2ri/scipy2ri/py2r.py in wrapper(obj)
     41 
     42         with localconverter(default_converter + numpy2ri.converter):
---> 43             return f(obj)
     44 
     45     return wrapper

~/miniconda3/envs/scIB-python/lib/python3.7/site-packages/anndata2ri/scipy2ri/py2r.py in csc_to_rmat(csc)
     55         p=as_integer(csc.indptr),
     56         x=conv_data(csc.data),
---> 57         Dim=as_integer(list(csc.shape)),
     58     )
     59 

~/miniconda3/envs/scIB-python/lib/python3.7/site-packages/rpy2/robjects/functions.py in __call__(self, *args, **kwargs)
    190                 kwargs[r_k] = v
    191         return (super(SignatureTranslatedFunction, self)
--> 192                 .__call__(*args, **kwargs))
    193 
    194 

~/miniconda3/envs/scIB-python/lib/python3.7/site-packages/rpy2/robjects/functions.py in __call__(self, *args, **kwargs)
    119             else:
    120                 new_kwargs[k] = conversion.py2rpy(v)
--> 121         res = super(Function, self).__call__(*new_args, **new_kwargs)
    122         res = conversion.rpy2py(res)
    123         return res

~/miniconda3/envs/scIB-python/lib/python3.7/site-packages/rpy2/rinterface_lib/conversion.py in _(*args, **kwargs)
     26 def _cdata_res_to_rinterface(function):
     27     def _(*args, **kwargs):
---> 28         cdata = function(*args, **kwargs)
     29         # TODO: test cdata is of the expected CType
     30         return _cdata_to_rinterface(cdata)

~/miniconda3/envs/scIB-python/lib/python3.7/site-packages/rpy2/rinterface.py in __call__(self, *args, **kwargs)
    783                     error_occured))
    784             if error_occured[0]:
--> 785                 raise embedded.RRuntimeError(_rinterface._geterrmessage())
    786         return res
    787 

RRuntimeError: Error in validObject(.Object) : 
  invalid class “dgCMatrix” object: row indices are not sorted within columns
Calls: <Anonymous> ... initialize -> callNextMethod -> .nextMethod -> validObject

Happy to clarify more if required. I hope description useful.

LuckyMD commented 2 years ago

Hey! Sorting indices is already part of the saveSeurat function: https://github.com/theislab/scib/blob/985d8155391fdfbddec024de428308b5a57ee280/scib/preprocessing.py#L485-L492

Maybe the issue is that your adata is a view? Could you run an adata = adata.copy() before converting?

ilibarra commented 2 years ago

Based on last comment, .copy() seems to be the solution, and not sorting indices.

The following snippet makes the saving of R with an fixed number of hvgs possible

adata_hvg = adata[:,adata.var.index.isin(hvgs)] # hvgs are defined in the early steps of the runPP.py script.
adata_hvg = adata_hvg.copy() # update `adata` in case a view.
scIB.preprocessing.saveSeurat(adata_hvg, outPath, batch, hvgs)

Thanks!

LuckyMD commented 2 years ago

Great! Is this also an issue in the scib-pipeline scripts, or is this your own code? If it's in the pipeline scripts, do you want to make a PR?