huidongchen / simba

SIMBA: SIngle-cell eMBedding Along with features
https://simba-bio.readthedocs.io
BSD 3-Clause "New" or "Revised" License
17 stars 1 forks source link

Error in executing si.tl.find_master_regulators: KeyError: 'SREBF1' #18

Open Tianran1998 opened 10 months ago

Tianran1998 commented 10 months ago

I am analyzing self-collected datasets of single-cell scRNAseq and single-cell ATACseq. The two datasets were obtained separately. I integrated single-cell transcriptomic data and single-cell ATAC data following the workflow provided by multimodal analysis. Subsequently, I obtained several files, namely adata_G, adata_M, adata_all, adata_cmp_CG, and adata_cmp_CM. I then executed the following code:


> motifs_genes = pd.DataFrame(columns=['motif', 'gene'])
> for x in adata_M.obs_names:
>     x_split = x.split('_')
>     for y in adata_G.obs_names:
>         if y in x_split:
>             motifs_genes.loc[motifs_genes.shape[0]] = [x,y]
> 
> motifs_genes
> duplicates = motifs_genes['motif'].duplicated()
> motifs_genes[duplicates]
> 
> print(motifs_genes.shape)
> motifs_genes.head()
> 
> motifs_genes_no_duplicates = motifs_genes.drop_duplicates(subset=['motif'])
> 
> 
> list_tf_motif = motifs_genes_no_duplicates ['motif'].tolist()
> list_tf_gene = motifs_genes_no_duplicates ['gene'].tolist()
> 
> df_metrics_motif = adata_cmp_CM.var.copy()
> df_metrics_gene = adata_cmp_CG.var.copy()
> 
> df_metrics_motif.head()
> df_metrics_gene.head()
> 
> si.pl.entity_metrics(adata_cmp_CG,x='max',y='gini',
>                      show_texts=False,
>                      show_cutoff=True,
>                      show_contour=True,
>                      c='#607e95',
>                      cutoff_x=1.5,
>                      cutoff_y=0.35)
> 
> 
> 
> 
> len(list_tf_motif)
> len(list_tf_gene)
> 
> 
> 
> df_MR = si.tl.find_master_regulators(adata_all,
>                                      list_tf_motif=list_tf_motif,
>                                      list_tf_gene=list_tf_gene,
>                                      cutoff_gene_max=1.5,
>                                      cutoff_gene_gini=0.35,
>                                      cutoff_motif_max=3,
>                                      cutoff_motif_gini=0.7,
>                                      metrics_gene=df_metrics_gene,
>                                      metrics_motif=df_metrics_motif
>                                     )
> 
> adata_all.obs

The following error occurred while running si.tl.find_master_regulators:


Traceback (most recent call last):
  File "/root/anaconda3/envs/env_simba/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3791, in get_loc
    return self._engine.get_loc(casted_key)
  File "index.pyx", line 152, in pandas._libs.index.IndexEngine.get_loc
  File "index.pyx", line 181, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 7080, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 7088, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'SREBF1'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/root/anaconda3/envs/env_simba/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3553, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-235-6901c5c76ad1>", line 1, in <module>
    df_MR = si.tl.find_master_regulators(adata_all,
  File "/root/anaconda3/envs/env_simba/lib/python3.10/site-packages/simba/tools/_post_training.py", line 618, in find_master_regulators
    df_MR.loc[i, 'rank'] = dist_MG.loc[x_motif, ].rank()[x_gene]
  File "/root/anaconda3/envs/env_simba/lib/python3.10/site-packages/pandas/core/series.py", line 1040, in __getitem__
    return self._get_value(key)
  File "/root/anaconda3/envs/env_simba/lib/python3.10/site-packages/pandas/core/series.py", line 1156, in _get_value
    loc = self.index.get_loc(label)
  File "/root/anaconda3/envs/env_simba/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3798, in get_loc
    raise KeyError(key) from err
KeyError: 'SREBF1'

Additionally, adata.PM.var_names is very strange; it doesn't consist of motifs but rather a list of genes. When running the scATAC-seq process, I used the hg38 annotation, and therefore, I also used the hg38 reference genome in Simba. Does this have any impact?

adata_PM.var_names

Index([        b'FOXF2',         b'FOXD1',          b'IRF2',   b'MZF1(var.2)',
             b'MAX_MYC',         b'PPARG',          b'PAX6',          b'PBX1',
                b'RORA',   b'RORA(var.2)',
       ...
               b'TEAD1',         b'TEAD4',        b'TFAP2A', b'TFAP2C(var.2)',
              b'TWIST1',          b'USF1',          b'USF2',           b'YY2',
              b'ZNF263',          b'CREM'],
      dtype='object', length=633)

In addition, the 'chr', 'start', and 'end' columns in adata_CP.var are derived by splitting the row names of the peak matrix output from Cell Ranger, as shown below.

chr_list, start_list, end_list = [], [], []

for var_name in adata_CP.var_names:
    parts = var_name.split('-')
    chr_list.append(parts[0])
    start_list.append(parts[1])
    end_list.append(parts[2])

len(adata_CP.var_names)
len(chr_list)
chr_df = pd.DataFrame({'chr': chr_list}, index=adata_CP.var_names)
adata_CP.var[['chr']] = chr_df

start_df = pd.DataFrame({'start': start_list}, index=adata_CP.var_names)
adata_CP.var[['start']] = start_df

end_df = pd.DataFrame({'start':end_list}, index=adata_CP.var_names)
adata_CP.var[['end']] = end_df