pinellolab / simba

SIMBA: SIngle-cell eMBedding Along with features
https://simba-bio.readthedocs.io
BSD 3-Clause "New" or "Revised" License
54 stars 7 forks source link

Problem generating the graph #16

Closed sinkahak closed 1 year ago

sinkahak commented 1 year ago

Hi,

thanks for developing this interesting tool!

I am having problems generating the peak-to-motif graph. The cell~peak graph generates just fine but I get the following info when adding list_PM:

si.tl.gen_graph(list_CP=[atac],
                list_PM=[motif],
                copy=False,
                use_top_pcs =True,
                dirname='graph0')

relation0: source: C, destination: P
#edges: 106854991
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[26], line 1
----> 1 si.tl.gen_graph(list_CP=[atac],
      2                 list_PM=[motif],
      3                 copy=False,
      4                 use_top_pcs =True,
      5                 dirname='graph0')

File ~/.conda/envs/env_simba/lib/python3.10/site-packages/simba/tools/_pbg.py:546, in gen_graph(list_CP, list_PM, list_PK, list_CG, list_CC, list_adata, prefix_C, prefix_P, prefix_M, prefix_K, prefix_G, prefix, layer, copy, dirname, add_edge_weights, use_highly_variable, use_top_pcs, use_top_pcs_CP, use_top_pcs_PM, use_top_pcs_PK)
    543 df_edges_x['source'] = df_peaks.loc[
    544     adata.obs_names[_row], 'alias'].values
    545 df_edges_x['relation'] = f'r{id_r}'
--> 546 df_edges_x['destination'] = df_motifs.loc[
    547     adata.var_names[_col], 'alias'].values
    548 if add_edge_weights:
    549     df_edges_x['weight'] = \
    550         arr_simba[_row, _col].A.flatten()

File ~/.conda/envs/env_simba/lib/python3.10/site-packages/pandas/core/frame.py:3950, in DataFrame.__setitem__(self, key, value)
   3947     self._setitem_array([key], value)
   3948 else:
   3949     # set column
-> 3950     self._set_item(key, value)

File ~/.conda/envs/env_simba/lib/python3.10/site-packages/pandas/core/frame.py:4143, in DataFrame._set_item(self, key, value)
   4133 def _set_item(self, key, value) -> None:
   4134     """
   4135     Add series to DataFrame in specified column.
   4136 
   (...)
   4141     ensure homogeneity.
   4142     """
-> 4143     value = self._sanitize_column(value)
   4145     if (
   4146         key in self.columns
   4147         and value.ndim == 1
   4148         and not is_extension_array_dtype(value)
   4149     ):
   4150         # broadcast across multiple columns if necessary
   4151         if not self.columns.is_unique or isinstance(self.columns, MultiIndex):

File ~/.conda/envs/env_simba/lib/python3.10/site-packages/pandas/core/frame.py:4870, in DataFrame._sanitize_column(self, value)
   4867     return _reindex_for_setitem(Series(value), self.index)
   4869 if is_list_like(value):
-> 4870     com.require_length_match(value, self.index)
   4871 return sanitize_array(value, self.index, copy=True, allow_2d=True)

File ~/.conda/envs/env_simba/lib/python3.10/site-packages/pandas/core/common.py:576, in require_length_match(data, index)
    572 """
    573 Check the length of data matches the length of the index.
    574 """
    575 if len(data) != len(index):
--> 576     raise ValueError(
    577         "Length of values "
    578         f"({len(data)}) "
    579         "does not match length of index "
    580         f"({len(index)})"
    581     )

ValueError: Length of values (14094436) does not match length of index (12259839)

My objects look like this prior generating the graph:

motif:
AnnData object with n_obs × n_vars = 179364 × 692
    obs: 'chr', 'start', 'end'
    var: 'top_pcs'
    uns: 'pca'
    obsm: 'X_pca'
    layers: 'simba'

atac:
AnnData object with n_obs × n_vars = 14131 × 246434
    obs: 'nCount_peaks', 'nFeature_peaks', 'dataset', 'fragments', 'orig.ident', 'FRiP', 'FRiP_density', 'blacklist_fraction', 'blacklist_density', 'nucleosome_signal', 'nucleosome_percentile', 'nucleosome_density', 'TSS.enrichment', 'TSS.percentile', 'TSS_density', 'high.tss', 'nucleosome_group', 'peaks_snn_res.0.5', 'seurat_clusters', 'pbg_id'
    var: 'chr', 'start', 'end', 'top_pcs', 'pbg_id'
    uns: 'pca'
    obsm: 'X_pca'
    layers: 'simba'

And session info:

-----
anndata             0.9.1
matplotlib_inline   0.1.6
numpy               1.24.4
pandas              2.0.3
scipy               1.11.1
session_info        1.0.0
simba               1.2
-----
Click to view modules imported as dependencies
-----
IPython             8.14.0
jupyter_client      8.3.0
jupyter_core        5.3.1
jupyterlab          4.0.2
notebook            6.5.4
-----
Python 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:40:32) [GCC 12.3.0]
Linux-3.10.0-1160.83.1.el7.x86_64-x86_64-with-glibc2.17
-----

I just can't wrap my brain around what could be the issue here. I tried to limit my atac object with the top PCs (14131 × 179364) but didn't get rid of the error. Do you know what could be causing this? Any help would be very much appreciated.

Best, Sini

huidongchen commented 1 year ago

Thank you for trying SIMBA! Would you mind sharing with me two subsets of these two anndata objects so that I can replicate the errors? This will make it easier for me to offer assistance.

sinkahak commented 1 year ago

Hi,

thank you for getting back to me and apologies for my delayed response!

I added you as a collaborator to my repository sc-GRN where you can find the subsets of atac and motifs anndata objects.

huidongchen commented 1 year ago

Thank you for providing the data. Unfortunately I am having trouble reading the two h5ad files. Below are the errors I encountered while attempting to read them:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/anaconda3/envs/env_simba/lib/python3.9/site-packages/anndata/_io/utils.py in func_wrapper(elem, *args, **kwargs)
    176         try:
--> 177             return func(elem, *args, **kwargs)
    178         except Exception as e:

~/anaconda3/envs/env_simba/lib/python3.9/site-packages/anndata/_io/h5ad.py in read_group(group)
    526     if encoding_type:
--> 527         EncodingVersions[encoding_type].check(
    528             group.name, group.attrs["encoding-version"]

~/anaconda3/envs/env_simba/lib/python3.9/enum.py in __getitem__(cls, name)
    431     def __getitem__(cls, name):
--> 432         return cls._member_map_[name]
    433 

KeyError: 'dict'

During handling of the above exception, another exception occurred:

AnnDataReadError                          Traceback (most recent call last)
/tmp/ipykernel_49249/2258671287.py in <module>
----> 1 adata_PM = si.read_h5ad('./motifs-subset-pp-forSimba.h5ad')
      2 adata_PM

~/anaconda3/envs/env_simba/lib/python3.9/site-packages/anndata/_io/h5ad.py in read_h5ad(filename, backed, as_sparse, as_sparse_fmt, chunk_size)
    419                 d[k] = read_dataframe(f[k])
    420             else:  # Base case
--> 421                 d[k] = read_attribute(f[k])
    422 
    423         d["raw"] = _read_raw(f, as_sparse, rdasp)

~/anaconda3/envs/env_simba/lib/python3.9/functools.py in wrapper(*args, **kw)
    875                             '1 positional argument')
    876 
--> 877         return dispatch(args[0].__class__)(*args, **kw)
    878 
    879     funcname = getattr(func, '__name__', 'singledispatch function')

~/anaconda3/envs/env_simba/lib/python3.9/site-packages/anndata/_io/utils.py in func_wrapper(elem, *args, **kwargs)
    181             else:
    182                 parent = _get_parent(elem)
--> 183                 raise AnnDataReadError(
    184                     f"Above error raised while reading key {elem.name!r} of "
    185                     f"type {type(elem)} from {parent}."

AnnDataReadError: Above error raised while reading key '/layers' of type <class 'h5py._hl.group.Group'> from /.
sinkahak commented 1 year ago

Hi,

that's odd, it works fine for me. Could it be due to different versions of anndata, I am using v0.9.1 and ad.read_h5ad to read the files:

motif = ad.read_h5ad(workdir + "motifs-subset-pp-forSimba.h5ad")
atac = ad.read_h5ad(workdir + "atac-subset-pp-forSimba.h5ad")
sinkahak commented 1 year ago

I was able to solve this issue on my own after all. My motif.var_names were not unique, which was causing the error.