theislab / scvelo

RNA Velocity generalized through dynamical modeling
https://scvelo.org
BSD 3-Clause "New" or "Revised" License
419 stars 103 forks source link

Corrupted neighbors graph using spatial data on scvelo #632

Closed shank117 closed 3 years ago

shank117 commented 3 years ago

Hi I have downloaded the BAM file from the spaceranger pipeline here: https://support.10xgenomics.com/spatial-gene-expression/datasets/1.3.0/Visium_FFPE_Human_Breast_Cancer? (Human breast cancer 1.3 genome aligned bam). Then I used Velocyto to create the .loom file. Then I followed the example workflow from scvelo from the endocrine. When I tried to run my dataset into scvelo I received a corrupted neighborhood graph error. I have checked that my .loom file is not corrupted because I can still access portions of it but the software cannot do anything with it.

Python:

import scvelo as scv
scv.logging.print_version()

scv.settings.verbosity = 3  # show errors(0), warnings(1), info(2), hints(3)
scv.settings.presenter_view = True  # set max width size for presenter view
scv.settings.set_figure_params('scvelo')  # for beautified visualization

#reading in my data
adata = scv.read('/gpfs0/home1/gddaslab/rssxm007/yard/run_spaceranger_count/velocyto/Visium_FFPE_Human_Breast_Cancer_possorted_genome_bam_5BKOT.loom', cache=True)

scv.utils.show_proportions(adata)
adata

scv.pp.filter_and_normalize(adata, min_shared_counts=20, n_top_genes=2000)
#scv.pp.moments(adata, n_pcs=30, n_neighbors=30)
scv.pp.neighbors(adata, n_pcs=30, n_neighbors=30)
# error in the line above 
scv.tl.velocity(adata)

scv.tl.velocity_graph(adata)

scv.pl.velocity_embedding_stream(adata, basis='pca')

scv.tl.recover_dynamics(adata)

scv.tl.velocity(adata, mode='dynamical')
scv.tl.velocity_graph(adata)

scv.tl.latent_time(adata)
scv.pl.scatter(adata, color='latent_time', color_map='gnuplot', size=80, colorbar=True)

top_genes = adata.var['fit_likelihood'].sort_values(ascending=False).index[:300]
scv.pl.heatmap(adata, var_names=top_genes, tkey='latent_time', n_convolve=100, col_color='Clusters')

scv.pl.scatter(adata, basis=top_genes[:10], frameon=False, ncols=5)

scv.pl.velocity_embedding_stream(adata, basis='pca', title='', smooth=.8, min_mass=4)

Output:

Running scvelo 0.2.3 (python 3.8.8) on 2021-08-13 12:44.
ERROR: XMLRPC request failed [code: -32500]
RuntimeError: PyPI's XMLRPC API is currently disabled due to unmanageable load and will be deprecated in the near future. See https://status.python.org/ for more information.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Abundance of ['spliced', 'unspliced']: [0.33 0.67]
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Filtered out 33521 genes that are detected 20 counts (shared).
Normalized count data: X, spliced, unspliced.
Skip filtering by dispersion since number of variables are less than `n_top_genes`.
Logarithmized X.
computing neighbors
    finished (0:00:03) --> added 
    'distances' and 'connectivities', weighted adjacency matrices (adata.obsp)
WARNING: The neighbor graph has an unexpected format (e.g. computed outside scvelo) 
or is corrupted (e.g. due to subsetting). Consider recomputing with `pp.neighbors`.
computing moments based on connectivities
    finished (0:00:00) --> added 
    'Ms' and 'Mu', moments of un/spliced abundances (adata.layers)
computing velocities
WARNING: You seem to have very low signal in splicing dynamics.
The correlation threshold has been reduced to -0.0556.
Please be cautious when interpreting results.
WARNING: Too few genes are selected as velocity genes. Consider setting a lower threshold for min_r2 or min_likelihood.
    finished (0:00:00) --> added 
    'velocity', velocity vectors for each individual cell (adata.layers)
WARNING: The neighbor graph has an unexpected format (e.g. computed outside scvelo) 
or is corrupted (e.g. due to subsetting). Consider recomputing with `pp.neighbors`.
Traceback (most recent call last):

  File "/home/gddaslab/rssxm007/yard/run_spaceranger_count/velocyto/spatialdata.py", line 27, in <module>
    scv.tl.velocity_graph(adata)

  File "/home/gddaslab/rssxm007/anaconda3/envs/velocyto/lib/python3.8/site-packages/scvelo/tools/velocity_graph.py", line 294, in velocity_graph
    vgraph = VelocityGraph(

  File "/home/gddaslab/rssxm007/anaconda3/envs/velocyto/lib/python3.8/site-packages/scvelo/tools/velocity_graph.py", line 102, in __init__
    raise ValueError(

ValueError: Your neighbor graph seems to be corrupted. Consider recomputing via pp.neighbors.

Versions:

scvelo 0.2.3 python 3.8.8

WeilerP commented 3 years ago

@shank117 please provide the output of scv.logging.print_versions() as asked for in the issue template.

giovp commented 3 years ago

hi @shank117 , jumping in on this as I noticed you are trying to run scavelo on FFPE tissue. Seems like the proportion of spliced/unspliced is off. Out of curiosity, did you try to run this analysis on standard Visium datasets (non FFPE)? thank you!

WeilerP commented 3 years ago

@shank117, also some input on your pipeline:

  1. You should call adata.var_names_make_unique as suggested by the output.
  2. You seem to filter out a lot of genes based on counts. You may want to reduce the number of min_shared_counts.
  3. Would you mind checking that observations do not appear to be duplicates after the filtering step?
  4. You should impute your data, i.e. run scv.pp.moments.
shank117 commented 3 years ago

@WeilerP Here is the output: Running scvelo 0.2.3 (python 3.8.8) on 2021-08-17 11:25. Numba: Attempted to fork from a non-main thread, the TBB library may be in an invalid state in the child process. ERROR: XMLRPC request failed [code: -32500] RuntimeError: PyPI's XMLRPC API is currently disabled due to unmanageable load and will be deprecated in the near future. See https://status.python.org/ for more information.

shank117 commented 3 years ago

@giovp

hi @shank117 , jumping in on this as I noticed you are trying to run scavelo on FFPE tissue. Seems like the proportion of spliced/unspliced is off. Out of curiosity, did you try to run this analysis on standard Visium datasets (non FFPE)? thank you!

I have not used non FFPE datasets. I am sorry I am new to all of this could you tell me what the difference is between FFPE datasets and non FFPE datasets are? I am using Visium datasets for all my analysis and it has worked for nonspatial datasets like: https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/pbmc_1k_v3?

shank117 commented 3 years ago

@WeilerP

I have run this now:

scv.pp.filter_and_normalize(adata, min_shared_counts=5, n_top_genes=1000)
scv.pp.moments(adata, n_pcs=20, n_neighbors=20)
scv.pp.neighbors(adata, n_pcs=20, n_neighbors=20)

The rest is the same as above.

The output is:

Running scvelo 0.2.3 (python 3.8.8) on 2021-08-17 11:33.
Numba: Attempted to fork from a non-main thread, the TBB library may be in an invalid state in the child process.
ERROR: XMLRPC request failed [code: -32500]
RuntimeError: PyPI's XMLRPC API is currently disabled due to unmanageable load and will be deprecated in the near future. See https://status.python.org/ for more information.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Abundance of ['spliced', 'unspliced']: [0.33 0.67]
Filtered out 33513 genes that are detected 5 counts (shared).
Normalized count data: X, spliced, unspliced.
Skip filtering by dispersion since number of variables are less than `n_top_genes`.
Logarithmized X.
computing neighbors
    finished (0:00:01) --> added 
    'distances' and 'connectivities', weighted adjacency matrices (adata.obsp)
WARNING: The neighbor graph has an unexpected format (e.g. computed outside scvelo) 
or is corrupted (e.g. due to subsetting). Consider recomputing with `pp.neighbors`.
computing moments based on connectivities
    finished (0:00:00) --> added 
    'Ms' and 'Mu', moments of un/spliced abundances (adata.layers)
computing neighbors
    finished (0:00:01) --> added 
    'distances' and 'connectivities', weighted adjacency matrices (adata.obsp)
computing velocities
WARNING: Too few genes are selected as velocity genes. Consider setting a lower threshold for min_r2 or min_likelihood.
    finished (0:00:00) --> added 
    'velocity', velocity vectors for each individual cell (adata.layers)
WARNING: The neighbor graph has an unexpected format (e.g. computed outside scvelo) 
or is corrupted (e.g. due to subsetting). Consider recomputing with `pp.neighbors`.
Traceback (most recent call last):

  File "/home/gddaslab/rssxm007/yard/run_spaceranger_count/velocyto/spatialdata.py", line 29, in <module>
    scv.tl.velocity_graph(adata)

  File "/home/gddaslab/rssxm007/anaconda3/envs/velocyto/lib/python3.8/site-packages/scvelo/tools/velocity_graph.py", line 294, in velocity_graph
    vgraph = VelocityGraph(

  File "/home/gddaslab/rssxm007/anaconda3/envs/velocyto/lib/python3.8/site-packages/scvelo/tools/velocity_graph.py", line 102, in __init__
    raise ValueError(

ValueError: Your neighbor graph seems to be corrupted.Consider recomputing via pp.neighbors.

Question: How would I check that observations are not duplicates after the filtering step? Again I am very new to rna analysis and scvelo. Thank you for you help

WeilerP commented 3 years ago

@WeilerP Here is the output: Running scvelo 0.2.3 (python 3.8.8) on 2021-08-17 11:25. Numba: Attempted to fork from a non-main thread, the TBB library may be in an invalid state in the child process. ERROR: XMLRPC request failed [code: -32500] RuntimeError: PyPI's XMLRPC API is currently disabled due to unmanageable load and will be deprecated in the near future. See https://status.python.org/ for more information.

This doesn't seem to be the output of scv.logging.print_versions(). It should print a number of package versions (e.g. scanpy and anndata).

@shank117 one quick remark: You should run the neighbor calculation prior to scv.pp.moments. You can check the number of duplicate rows e.g. via

adata.to_df().duplicated().sum()

I'd first check that there are no duplicate rows both immediately after reading the data and then after filtering (the latter should be a rare event). In either way, you can drop the duplicate rows using

adata = adata[~adata.to_df().duplicated(), :]

You should also run

adata.var_names_make_unique()

prior to the analysis pipeline.

shank117 commented 3 years ago

@WeilerP

Thank you for the advice. I did not find any duplicates and ran what you suggested but I am still getting the same error unfortunately with the same output. Any other suggestions?

giovp commented 3 years ago

@shank117

I have not used non FFPE datasets. I am sorry I am new to all of this could you tell me what the difference is between FFPE datasets and non FFPE datasets are? I am using Visium datasets for all my analysis and it has worked for nonspatial datasets like: support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/pbmc_1k_v3?

I am not entirely sure ( @WeilerP knows more) but if the issue is the quality of the data itself (and If I'm not mistaken the spliced/unspliced ratio looks off), I'd go with non-FFPE data. FFPE data are known to have lower quality because of the fixation protocol.

WeilerP commented 3 years ago

Thank you for the advice. I did not find any duplicates and ran what you suggested but I am still getting the same error unfortunately with the same output. Any other suggestions?

@shank117, did you check for duplicate rows both after reading the data as well as after calling scv.pp.filter_and_normalize? If so, can you share the AnnData object you are running your pipeline on? I'd have to take a closer look at it when I find the time.

shank117 commented 3 years ago

@WeilerP

Here is the loom file: Visium_FFPE_Human_Breast_Cancer_possorted_genome_bam_5BKOT.loom.zip

WeilerP commented 3 years ago

@shank117, as suspected, your data does contain duplicate rows (70 after reading, 575 after filtering). Removing them solves the problem.

WeilerP commented 3 years ago

This is also a duplicate of #637.

giovp commented 3 years ago

@WeilerP sorry but out of curiosity how does this issue arise? Is it velocyto or something happening with cellrange/spaceranger at mapping steps?

WeilerP commented 3 years ago

@WeilerP sorry but out of curiosity how does this issue arise? Is it velocyto or something happening with cellrange/spaceranger at mapping steps?

Not sure why this occurs in the initial data TBH. I would guess it (same measurements in two different cells) is either extremely rare or observations are actual duplicates (did not check if names were duplicates, but anndata would point it out I believe). The duplicate rows after filtering occur by chance when we have genes with few counts and the observations becoming duplicates differ only in those genes.

giovp commented 3 years ago

The duplicate rows after filtering occur by chance when we have genes with few counts and the observations becoming duplicates differ only in those genes.

right ok so it's just matter of chance? do you think it is more likely in bad-quality data (where e.g. you'd have much sparser matrxi and so easier to get duplicated observations? )

WeilerP commented 3 years ago

right ok so it's just matter of chance? do you think it is more likely in bad-quality data (where e.g. you'd have much sparser matrxi and so easier to get duplicated observations? )

Yes, I would assume that this phenomenon comes up more often if data quality is low.