icbi-lab / infercnvpy

Infer copy number variation (CNV) from scRNA-seq data. Plays nicely with Scanpy.
https://infercnvpy.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
133 stars 27 forks source link

ValueError: need at least one array to concatenate #7

Open mibo1996 opened 3 years ago

mibo1996 commented 3 years ago

Hi,

I am trying to use infercnvpy on my single-cell dataset, but when I run: cnv.tl.infercnv( adata, reference_key="leiden_1.0", reference_cat='12', )

I get the error message: ValueError: need at least one array to concatenate.

When I run: adata.var.loc[:, ["gene_ids", "chromosome", "start", "end"]].head() I can see that these columns were successfully added to adata.var

When I run: adata.obs.head() I can see that there is a column, 'leiden_1.0' with clusters numbered '0'-'16'.

I'm not sure what this error message means or what I can do differently to fix it.

Thank you

suegrimes commented 3 years ago

I am also getting the same error, running:
infercnvpy.tl.infercnv(seu_epi, reference_key='seurat_clusters', reference_cat=['0','1'], window_size=200)

ValueError: need at least one array to concatenate

seu_epi AnnData object with n_obs × n_vars = 5136 × 16784 obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'percent.mito', 'nCount_SCT', 'nFeature_SCT', 'S.Score', 'G2M.Score', 'Phase', 'CC.Difference', 'SCT_snn_res.0.8', 'seurat_clusters', 'tree.ident', 'condition', 'patientID', 'pANN', 'DF', 'SCT_snn_res.1', 'SCT_snn_res.0.6' var: 'sct.detection_rate', 'sct.gmean', 'sct.variance', 'sct.residual_mean', 'sct.residual_variance', 'sct.variable', 'chromosome', 'start', 'end', 'gene_id', 'gene_name' uns: 'neighbors' obsm: 'X_pca', 'X_umap' varm: 'PCs' obsp: 'distances'

seu_epi.X array([[-0.12743472, -0.12882212, -0.06847287, ..., -0.03828714, -0.01745232, -0.03993058], [-0.10143117, -0.10215754, -0.0520843 , ..., -0.0263906 , -0.00942162, -0.02850649], [-0.1182593 , -0.11941153, -0.06270597, ..., -0.03406237, -0.01457542, -0.03586897], ..., [-0.10151398, -0.10224243, -0.05213675, ..., -0.02642813, -0.00944661, -0.02854246], [-0.06104027, -0.06079169, -0.02616703, ..., -0.00829963, 0.00237056, -0.01121682], [-0.07975152, -0.07994411, -0.03827017, ..., -0.01663157, -0.00299579, -0.01916628]])

seu_epi.obs.loc[:, ['seurat_clusters','condition']].head() seurat_clusters condition AACTGGTCATTCACTT-1 17 tumor_appendix ACGCCGATCTGAAAGA-1 3 tumor_appendix ACTGATGAGCACCGTC-1 3 tumor_appendix AGCATACTCTACTCAT-1 10 tumor_appendix CAACCTCCACAACTGT-1 7 tumor_appendix

grst commented 3 years ago

Hi,

thanks for reporting, I'll look into it. Could you please provide the full stacktrace of the error (together with the error message, the line numbers in the code where the error occurs should show)? That would be extremely helpful for debugging!

suegrimes commented 3 years ago

Hi Gregor - thanks for looking into this! Stacktrace below:

/venvs/anaconda/envs/groupenvs/infercnv-py/lib/python3.8/site-packages/anndata/c                                            ompat/__init__.py:180: FutureWarning: Moving element from .uns['neighbors']['dis                                                     tances'] to .obsp['distances'].

This is where adjacency matrices should go now.
  warn(
INFO:root:Extracted GTF attributes: ['gene_id', 'gene_name']
WARNING: GTF file misses annotation for 13 genes in adata.
WARNING: Skipped 13 genes because of duplicate identifiers in GTF file.
... storing 'orig.ident' as categorical
... storing 'Phase' as categorical
... storing 'seurat_clusters' as categorical
... storing 'condition' as categorical
... storing 'patientID' as categorical
... storing 'DF' as categorical
... storing 'chromosome' as categorical
... storing 'gene_id' as categorical
... storing 'gene_name' as categorical
WARNING: Skipped 26 genes because they don't have a genomic position annotated.
/venvs/anaconda/envs/groupenvs/infercnv-py/lib/python3.8/site-packages/numpy/lib                                                     /arraysetops.py:583: FutureWarning: elementwise comparison failed; returning sca                                                     lar instead, but in the future will perform elementwise comparison
  mask |= (ar1 == a)

Traceback (most recent call last):
  File "run_infercnv.py", line 18, in <module>
    infercnvpy.tl.infercnv(seu_epi, reference_key='seurat_clusters', reference_c                                                     at=[0,1], window_size=200)
  File "/venvs/anaconda/envs/groupenvs/infercnv-py/lib/python3.8/site-packages/i                                                     nfercnvpy/tl/_infercnv.py", line 164, in infercnv
    reference = _get_reference(tmp_adata, reference_key, reference_cat, referenc                                                     e)
  File "/venvs/anaconda/envs/groupenvs/infercnv-py/lib/python3.8/site-packages/i                                                     nfercnvpy/tl/_infercnv.py", line 305, in _get_reference
    raise ValueError(
ValueError: The following reference categories were not found in adata.obs[refer                                                     ence_key]: [0 1]
(/venvs/anaconda/envs/groupenvs/infercnv-py) sgrimes@ikura:C03_210423_inferCNV_p                                                     yTest$ python run_infercnv.py
/venvs/anaconda/envs/groupenvs/infercnv-py/lib/python3.8/site-packages/anndata/c                                                     ompat/__init__.py:180: FutureWarning: Moving element from .uns['neighbors']['dis                                                     tances'] to .obsp['distances'].

This is where adjacency matrices should go now.
  warn(
INFO:root:Extracted GTF attributes: ['gene_id', 'gene_name']
WARNING: GTF file misses annotation for 13 genes in adata.
WARNING: Skipped 13 genes because of duplicate identifiers in GTF file.
... storing 'orig.ident' as categorical
... storing 'Phase' as categorical
... storing 'seurat_clusters' as categorical
... storing 'condition' as categorical
... storing 'patientID' as categorical
... storing 'DF' as categorical
... storing 'chromosome' as categorical
... storing 'gene_id' as categorical
... storing 'gene_name' as categorical
WARNING: Skipped 26 genes because they don't have a genomic position annotated.
  0%|                                                                                                          | 0/2 [00:11<?, ?it/s]
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
  File "/venvs/anaconda/envs/groupenvs/infercnv-py/lib/python3.8/concurrent/futures/process.py", line 239, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/venvs/anaconda/envs/groupenvs/infercnv-py/lib/python3.8/concurrent/futures/process.py", line 198, in _process_chunk
    return [fn(*args) for args in chunk]
  File "/venvs/anaconda/envs/groupenvs/infercnv-py/lib/python3.8/concurrent/futures/process.py", line 198, in <listcomp>
    return [fn(*args) for args in chunk]
  File "/venvs/anaconda/envs/groupenvs/infercnv-py/lib/python3.8/site-packages/infercnvpy/tl/_infercnv.py", line 353, in _infercnv_chunk
    chr_pos, x_smoothed = _running_mean_by_chromosome(
  File "/venvs/anaconda/envs/groupenvs/infercnv-py/lib/python3.8/site-packages/infercnvpy/tl/_infercnv.py", line 273, in _running_mean_by_chromosome
    return chr_start_pos, np.hstack(running_means)
  File "<__array_function__ internals>", line 5, in hstack
  File "/venvs/anaconda/envs/groupenvs/infercnv-py/lib/python3.8/site-packages/numpy/core/shape_base.py", line 346, in hstack
    return _nx.concatenate(arrs, 1)
  File "<__array_function__ internals>", line 5, in concatenate
ValueError: need at least one array to concatenate
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "run_infercnv.py", line 18, in <module>
    infercnvpy.tl.infercnv(seu_epi, reference_key='seurat_clusters', reference_cat=['0','1'], window_size=200)
  File "/venvs/anaconda/envs/groupenvs/infercnv-py/lib/python3.8/site-packages/infercnvpy/tl/_infercnv.py", line 169, in infercnv
    *process_map(
  File "/venvs/anaconda/envs/groupenvs/infercnv-py/lib/python3.8/site-packages/tqdm/contrib/concurrent.py", line 130, in process_map
    return _executor_map(ProcessPoolExecutor, fn, *iterables, **tqdm_kwargs)
  File "/venvs/anaconda/envs/groupenvs/infercnv-py/lib/python3.8/site-packages/tqdm/contrib/concurrent.py", line 76, in _executor_map
    return list(tqdm_class(ex.map(fn, *iterables, **map_args), **kwargs))
  File "/venvs/anaconda/envs/groupenvs/infercnv-py/lib/python3.8/site-packages/tqdm/std.py", line 1178, in __iter__
    for obj in iterable:
  File "/venvs/anaconda/envs/groupenvs/infercnv-py/lib/python3.8/concurrent/futures/process.py", line 484, in _chain_from_iterable_of_lists
    for element in iterable:
  File "/venvs/anaconda/envs/groupenvs/infercnv-py/lib/python3.8/concurrent/futures/_base.py", line 611, in result_iterator
    yield fs.pop().result()
  File "/venvs/anaconda/envs/groupenvs/infercnv-py/lib/python3.8/concurrent/futures/_base.py", line 439, in result
    return self.__get_result()
  File "/venvs/anaconda/envs/groupenvs/infercnv-py/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
    raise self._exception
ValueError: need at least one array to concatenate
suegrimes commented 3 years ago

In case it factors into the troubleshooting, my AnnData object started out as a Seurat 3.x object, updated to Seurat 4.0, then converted to hd5, then to AnnData.

grst commented 3 years ago

Can you please post the result of

adata.obs["chromosome"].unique()

right before running tl.infercnv?

suegrimes commented 3 years ago

I assume you mean adata.var["chromosome"].unique()? Results below. I see that there is a nan entry, probably due to the duplicate genes being removed. Is that the issue?

seu_epi.var["chromosome"].unique() array(['1', nan, '2', '3', '4', '5', '6', '7', 'X', '8', '9', '11', '10', '12', '13', '14', '15', '16', '17', '18', '20', '19', 'Y', '22', '21', 'MT', 'GL000194.1', 'GL000195.1', 'GL000219.1', 'KI270734.1', 'GL000218.1', 'KI270726.1', 'KI270711.1', 'KI270721.1'], dtype=object)

grst commented 3 years ago

Thanks! The problem is that I expected all chromosomes (as opposed to scaffolds) to start with chr.

https://github.com/icbi-lab/infercnvpy/blob/c5bd39b1cda67a16d2a448fa8b75fdaa76acdce0/infercnvpy/tl/_infercnv.py#L255-L257

What GTF file are you using? I only tried with GENCODE I think.

suegrimes commented 3 years ago

Ah, glad it's an easy explanation! I'm using the gtf file provided for cellranger by 10X genomics. I think it is ensembl. The first few lines below:

!genome-build GRCh38.p12

!genome-version GRCh38

!genome-date 2013-12

!genome-build-accession NCBI:GCA_000001405.27

!genebuild-last-updated 2018-01

1 havana gene 29554 31109 . + . gene_id "ENSG00000243485"; gene_version "5"; gene_name "MIR1302-2HG"; gene_source "havana"; gene_biotype "lincRNA"

1 havana transcript 29554 31097 . + . gene_id "ENSG00000243485"; gene_version "5"; transcript_id "ENST00000473358"; transcript_version "1"; gene_name "MIR1302-2HG"; gene_source "havana"; gene_biotype "lincRNA"; transcript_name "MIR1302-2HG-202"; transcript_source "havana"; transcript_biotype "lincRNA"; tag "basic"; transcript_support_level "5"

1 havana exon 29554 30039 . + . gene_id "ENSG00000243485"; gene_version "5"; transcript_id "ENST00000473358"; transcript_version "1"; exon_number "1"; gene_name "MIR1302-2HG"; gene_source "havana"; gene_biotype "lincRNA"; transcript_name "MIR1302-2HG-202"; transcript_source "havana"; transcript_biotype "lincRNA"; exon_id "ENSE00001947070"; exon_version "1"; tag "basic"; transcript_support_level "5"

1 havana exon 30564 30667 . + . gene_id "ENSG00000243485"; gene_version "5"; transcript_id "ENST00000473358"; transcript_version "1"; exon_number "2"; gene_name "MIR1302-2HG"; gene_source "havana"; gene_biotype "lincRNA"; transcript_name "MIR1302-2HG-202"; transcript_source "havana"; transcript_biotype "lincRNA"; exon_id "ENSE00001922571"; exon_version "1"; tag "basic"; transcript_support_level "5"

1 havana exon 30976 31097 . + . gene_id "ENSG00000243485"; gene_version "5"; transcript_id "ENST00000473358"; transcript_version "1"; exon_number "3"; gene_name "MIR1302-2HG"; gene_source "havana"; gene_biotype "lincRNA"; transcript_name "MIR1302-2HG-202"; transcript_source "havana"; transcript_biotype "lincRNA"; exon_id "ENSE00001827679"; exon_version "1"; tag "basic"; transcript_support_level "5"

1 havana transcript 30267 31109 . + . gene_id "ENSG00000243485"; gene_version "5"; transcript_id "ENST00000469289"; transcript_version "1"; gene_name "MIR1302-2HG"; gene_source "havana"; gene_biotype "lincRNA"; transcript_name "MIR1302-2HG-201"; transcript_source "havana"; transcript_biotype "lincRNA"; tag "basic"; transcript_support_level "5"

1 havana exon 30267 30667 . + . gene_id "ENSG00000243485"; gene_version "5"; transcript_id "ENST00000469289"; transcript_version "1"; exon_number "1"; gene_name "MIR1302-2HG"; gene_source "havana"; gene_biotype "lincRNA"; transcript_name "MIR1302-2HG-201"; transcript_source "havana"; transcript_biotype "lincRNA"; exon_id "ENSE00001841699"; exon_version "1"; tag "basic"; transcript_support_level "5"

grst commented 3 years ago

Ok, I'll look into improving the genomic_position_from_gtf function at some point that it supports other GTF files as well. Let's keep the issue open until then.

In the meanwhile, you could try a GENCODE file, or just modify the chromosome column in adata.var such that it has the chr prefix.

suegrimes commented 3 years ago

ok got it. I will try one of those two things. Thanks for being so responsive looking into this!

stefanpeidli commented 3 years ago

In the meantime a small and dirty trick: adata.var['chromosome'] = ['chr'+str(i) for i in adata.var['chromosome']]

Then it worked for me.