velocyto-team / velocyto.R

RNA velocity estimation in R
http://velocyto.org
177 stars 215 forks source link

Inconsistency Given Different Gene Sets #90

Open whitleyo opened 5 years ago

whitleyo commented 5 years ago

I've attempted to assess the effects of tweaking parameters and genes selected in velocyto.R, and for my particular dataset, it seems that in the ballpark of the thresholds used for filter.genes.by.cluster.expression in the vignettes shown for velocyto.R (minimum avg cluster expression of 0.1 to 2.0 for spliced genes, 0.05 to 0.2 for unspliced genes), RNA velocity calculations are rather sensitive to which genes are sampled. I should note that fit quantile was kept constant at 0.02 and that common genes between spliced and unspliced were subsampled to assess effects of variation of genes in input. Data is ~550 cells from a human tissue sample. Note that cells were filtered for those with at least 200 counts detected, spliced genes were filtered for those with expression in at least 30 cells with at least 40 counts across all cells, and unspliced genes were filtered for those with at least 25 counts and expression in at least 20 cells.

I measured the spearman correlation between velocity vectors for the same cells but from different runs, and did this for all cells, for all runs, for all 4 combinations of min.max.cluster.average parameter for spliced and unspliced genes.

image

image

Looking at runs using identical parameters but different subsets of cells, inter-run reproducibility for identical cells does not seem to improve

image

image

image

image

Here are a few examples of discordant velocity results plotted on the same TSNE image

image

image

There is some apparent structure to the data. The cells have been scored with AUCell using signatures I've derived based off of differential expression in bulk RNA-seq data

image

image

Just to show that different seeds are resulting in different subsets of genes, and that different thresholds of average cluster expression result in different sized sets of genes used for velocyto, I've included a heatmap of the genesets used in each run and the overlap between them. Cells indicated geneset size or number of overalapping genes (if off diagonal), and color indicates jaccard index. rownames and column names indicate seed used and run params.

image

Session Info:

## R version 3.5.2 (2018-12-20)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.5 LTS
## 
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
## 
## locale:
##  [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8    
##  [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8   
##  [7] LC_PAPER=en_CA.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
##  [1] grid      splines   parallel  stats4    stats     graphics  grDevices
##  [8] utils     datasets  methods   base     
## 
## other attached packages:
##  [1] mclust_5.4.4                dbscan_1.1-3               
##  [3] circlize_0.4.6              ComplexHeatmap_1.20.0      
##  [5] SNFtool_2.3.0               scater_1.10.1              
##  [7] scran_1.10.2                AUCell_1.4.1               
##  [9] R.devices_2.16.0            monocle_2.10.1             
## [11] DDRTree_0.1.5               irlba_2.3.3                
## [13] VGAM_1.1-1                  pagoda2_0.1.0              
## [15] igraph_1.2.4.1              velocyto.R_0.6             
## [17] DESeq2_1.22.2               Matrix_1.2-15              
## [19] reshape2_1.4.3              ggplot2_3.2.0              
## [21] su2cproj_0.1.016            doParallel_1.0.14          
## [23] iterators_1.0.10            foreach_1.4.4              
## [25] SingleCellExperiment_1.4.1  SummarizedExperiment_1.12.0
## [27] DelayedArray_0.8.0          BiocParallel_1.16.6        
## [29] matrixStats_0.54.0          Biobase_2.42.0             
## [31] GenomicRanges_1.34.0        GenomeInfoDb_1.18.2        
## [33] IRanges_2.16.0              S4Vectors_0.20.1           
## [35] BiocGenerics_0.28.0        
## 
## loaded via a namespace (and not attached):
##   [1] backports_1.1.4          Hmisc_4.2-0             
##   [3] plyr_1.8.4               lazyeval_0.2.2          
##   [5] heatmap.plus_1.3         GSEABase_1.44.0         
##   [7] densityClust_0.3         fastICA_1.2-1           
##   [9] urltools_1.7.3           ExPosition_2.8.23       
##  [11] digest_0.6.19            htmltools_0.3.6         
##  [13] viridis_0.5.1            alluvial_0.1-2          
##  [15] magrittr_1.5             checkmate_1.9.3         
##  [17] memoise_1.1.0            cluster_2.0.7-1         
##  [19] limma_3.38.3             annotate_1.60.1         
##  [21] R.utils_2.9.0            docopt_0.6.1            
##  [23] colorspace_1.4-1         blob_1.1.1              
##  [25] ggrepel_0.8.1            xfun_0.8                
##  [27] dplyr_0.8.2              sparsesvd_0.1-4         
##  [29] crayon_1.3.4             RCurl_1.95-4.12         
##  [31] graph_1.60.0             genefilter_1.64.0       
##  [33] brew_1.0-6               survival_2.43-3         
##  [35] glue_1.3.1               gtable_0.3.0            
##  [37] zlibbioc_1.28.0          XVector_0.22.0          
##  [39] GetoptLong_0.1.7         Rhdf5lib_1.4.3          
##  [41] Rook_1.1-1               shape_1.4.4             
##  [43] HDF5Array_1.10.1         scales_1.0.0            
##  [45] pheatmap_1.0.12          DBI_1.0.0               
##  [47] edgeR_3.24.3             Rcpp_1.0.1              
##  [49] viridisLite_0.3.0        xtable_1.8-4            
##  [51] htmlTable_1.13.1         foreign_0.8-71          
##  [53] bit_1.1-14               Formula_1.2-3           
##  [55] htmlwidgets_1.3          FNN_1.1.3               
##  [57] RColorBrewer_1.1-2       acepack_1.4.1           
##  [59] pkgconfig_2.0.2          XML_3.98-1.19           
##  [61] R.methodsS3_1.7.1        nnet_7.3-12             
##  [63] locfit_1.5-9.1           dynamicTreeCut_1.63-1   
##  [65] tidyselect_0.2.5         labeling_0.3            
##  [67] rlang_0.4.0              later_0.8.0             
##  [69] AnnotationDbi_1.44.0     munsell_0.5.0           
##  [71] tools_3.5.2              RSQLite_2.1.1           
##  [73] evaluate_0.14            stringr_1.4.0           
##  [75] yaml_2.2.0               knitr_1.23              
##  [77] bit64_0.9-7              purrr_0.3.2             
##  [79] RANN_2.6.1               nlme_3.1-137            
##  [81] mime_0.7                 slam_0.1-45             
##  [83] R.oo_1.22.0              prettyGraphs_2.1.6      
##  [85] hdf5r_1.2.0              compiler_3.5.2          
##  [87] rstudioapi_0.10          beeswarm_0.2.3          
##  [89] statmod_1.4.32           tibble_2.1.3            
##  [91] geneplotter_1.60.0       stringi_1.4.3           
##  [93] lattice_0.20-38          HSMMSingleCell_1.2.0    
##  [95] pillar_1.4.2             GlobalOptions_0.1.0     
##  [97] combinat_0.0-8           triebeard_0.3.0         
##  [99] BiocNeighbors_1.0.0      data.table_1.12.2       
## [101] bitops_1.0-6             httpuv_1.5.1            
## [103] R6_2.4.0                 latticeExtra_0.6-28     
## [105] pcaMethods_1.74.0        promises_1.0.1          
## [107] gridExtra_2.3            vipor_0.4.5             
## [109] codetools_0.2-15         MASS_7.3-51.1           
## [111] assertthat_0.2.1         rhdf5_2.26.2            
## [113] rjson_0.2.20             withr_2.1.2             
## [115] qlcMatrix_0.9.7          GenomeInfoDbData_1.2.0  
## [117] mgcv_1.8-26              rpart_4.1-13            
## [119] rmarkdown_1.13           DelayedMatrixStats_1.4.0
## [121] Rtsne_0.15               shiny_1.3.2             
## [123] base64enc_0.1-3          ggbeeswarm_0.6.0
pkharchenko commented 5 years ago

Doesn’t look like it’s picking up any real signal. Are these cells actually moving rapidly enough in a particular direction? Best, -peter.

On Jul 12, 2019, at 17:02, whitleyo notifications@github.com wrote:

I've attempted to assess the effects of tweaking parameters and genes selected in velocyto.R, and for my particular dataset, it seems that in the ballpark of the thresholds used for filter.genes.by.cluster.expression in the vignettes shown for velocyto.R (minimum avg cluster expression of 0.1 to 2.0 for spliced genes, 0.05 to 0.2 for unspliced genes), RNA velocity calculations are rather sensitive to which genes are sampled. I should note that fit quantile was kept constant at 0.02 and that common genes between spliced and unspliced were subsampled to assess effects of variation of genes in input. Data is ~550 cells from a human tissue sample. Note that cells were filtered for those with at least 200 counts detected, spliced genes were filtered for those with expression in at least 30 cells with at least 40 counts across all cells, and unspliced genes were filtered for those with at least 25 counts and expression in at least 20 cells.

I measured the spearman correlation between velocity vectors for the same cells but from different runs, and did this for all cells, for all runs, for all 4 combinations of min.max.cluster.average parameter for spliced and unspliced genes.

Looking at runs using identical parameters but different subsets of cells, inter-run reproducibility for identical cells does not seem to improve

Here are a few examples of discordant velocity results plotted on the same TSNE

There is some apparent structure to the data. The cells have been scored with AUCell using signatures I've derived based off of differential expression in bulk RNA-seq data

Just to show that different seeds are resulting in different subsets of genes, and that different thresholds of average cluster expression result in different sized sets of genes used for velocyto, I've included a heatmap of the genesets used in each run and the overlap between them. Cells indicated geneset size or number of overalapping genes (if off diagonal), and color indicates jaccard index. rownames and column names indicate seed used and run params.

Session Info:

R version 3.5.2 (2018-12-20)

Platform: x86_64-pc-linux-gnu (64-bit)

Running under: Ubuntu 16.04.5 LTS

Matrix products: default

BLAS: /usr/lib/libblas/libblas.so.3.6.0

LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:

[1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C

[3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8

[5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8

[7] LC_PAPER=en_CA.UTF-8 LC_NAME=C

[9] LC_ADDRESS=C LC_TELEPHONE=C

[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C

attached base packages:

[1] grid splines parallel stats4 stats graphics grDevices

[8] utils datasets methods base

other attached packages:

[1] mclust_5.4.4 dbscan_1.1-3

[3] circlize_0.4.6 ComplexHeatmap_1.20.0

[5] SNFtool_2.3.0 scater_1.10.1

[7] scran_1.10.2 AUCell_1.4.1

[9] R.devices_2.16.0 monocle_2.10.1

[11] DDRTree_0.1.5 irlba_2.3.3

[13] VGAM_1.1-1 pagoda2_0.1.0

[15] igraph_1.2.4.1 velocyto.R_0.6

[17] DESeq2_1.22.2 Matrix_1.2-15

[19] reshape2_1.4.3 ggplot2_3.2.0

[21] su2cproj_0.1.016 doParallel_1.0.14

[23] iterators_1.0.10 foreach_1.4.4

[25] SingleCellExperiment_1.4.1 SummarizedExperiment_1.12.0

[27] DelayedArray_0.8.0 BiocParallel_1.16.6

[29] matrixStats_0.54.0 Biobase_2.42.0

[31] GenomicRanges_1.34.0 GenomeInfoDb_1.18.2

[33] IRanges_2.16.0 S4Vectors_0.20.1

[35] BiocGenerics_0.28.0

loaded via a namespace (and not attached):

[1] backports_1.1.4 Hmisc_4.2-0

[3] plyr_1.8.4 lazyeval_0.2.2

[5] heatmap.plus_1.3 GSEABase_1.44.0

[7] densityClust_0.3 fastICA_1.2-1

[9] urltools_1.7.3 ExPosition_2.8.23

[11] digest_0.6.19 htmltools_0.3.6

[13] viridis_0.5.1 alluvial_0.1-2

[15] magrittr_1.5 checkmate_1.9.3

[17] memoise_1.1.0 cluster_2.0.7-1

[19] limma_3.38.3 annotate_1.60.1

[21] R.utils_2.9.0 docopt_0.6.1

[23] colorspace_1.4-1 blob_1.1.1

[25] ggrepel_0.8.1 xfun_0.8

[27] dplyr_0.8.2 sparsesvd_0.1-4

[29] crayon_1.3.4 RCurl_1.95-4.12

[31] graph_1.60.0 genefilter_1.64.0

[33] brew_1.0-6 survival_2.43-3

[35] glue_1.3.1 gtable_0.3.0

[37] zlibbioc_1.28.0 XVector_0.22.0

[39] GetoptLong_0.1.7 Rhdf5lib_1.4.3

[41] Rook_1.1-1 shape_1.4.4

[43] HDF5Array_1.10.1 scales_1.0.0

[45] pheatmap_1.0.12 DBI_1.0.0

[47] edgeR_3.24.3 Rcpp_1.0.1

[49] viridisLite_0.3.0 xtable_1.8-4

[51] htmlTable_1.13.1 foreign_0.8-71

[53] bit_1.1-14 Formula_1.2-3

[55] htmlwidgets_1.3 FNN_1.1.3

[57] RColorBrewer_1.1-2 acepack_1.4.1

[59] pkgconfig_2.0.2 XML_3.98-1.19

[61] R.methodsS3_1.7.1 nnet_7.3-12

[63] locfit_1.5-9.1 dynamicTreeCut_1.63-1

[65] tidyselect_0.2.5 labeling_0.3

[67] rlang_0.4.0 later_0.8.0

[69] AnnotationDbi_1.44.0 munsell_0.5.0

[71] tools_3.5.2 RSQLite_2.1.1

[73] evaluate_0.14 stringr_1.4.0

[75] yaml_2.2.0 knitr_1.23

[77] bit64_0.9-7 purrr_0.3.2

[79] RANN_2.6.1 nlme_3.1-137

[81] mime_0.7 slam_0.1-45

[83] R.oo_1.22.0 prettyGraphs_2.1.6

[85] hdf5r_1.2.0 compiler_3.5.2

[87] rstudioapi_0.10 beeswarm_0.2.3

[89] statmod_1.4.32 tibble_2.1.3

[91] geneplotter_1.60.0 stringi_1.4.3

[93] lattice_0.20-38 HSMMSingleCell_1.2.0

[95] pillar_1.4.2 GlobalOptions_0.1.0

[97] combinat_0.0-8 triebeard_0.3.0

[99] BiocNeighbors_1.0.0 data.table_1.12.2

[101] bitops_1.0-6 httpuv_1.5.1

[103] R6_2.4.0 latticeExtra_0.6-28

[105] pcaMethods_1.74.0 promises_1.0.1

[107] gridExtra_2.3 vipor_0.4.5

[109] codetools_0.2-15 MASS_7.3-51.1

[111] assertthat_0.2.1 rhdf5_2.26.2

[113] rjson_0.2.20 withr_2.1.2

[115] qlcMatrix_0.9.7 GenomeInfoDbData_1.2.0

[117] mgcv_1.8-26 rpart_4.1-13

[119] rmarkdown_1.13 DelayedMatrixStats_1.4.0

[121] Rtsne_0.15 shiny_1.3.2

[123] base64enc_0.1-3 ggbeeswarm_0.6.0

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

whitleyo commented 5 years ago

The tumour we're looking at is a glioblastoma. I've done a little searching around for any info on transcriptional change, but I can't really be sure what the expected 'rate of change' is.

Looking through a paper on hypoxia and GBM (Li et al. 2009), it appears that once you stick either GSC or non-GSC GBM cells in culture for about 12 hours to a day, you can get hypoxia induced transcriptional changes in the space of hours (https://www.ncbi.nlm.nih.gov/pubmed/19477429). Korber et al.(2018) estimated that the cell division rate was about once in 10 days (lower bound) (https://www.sciencedirect.com/science/article/pii/S1535610819301023?via%3Dihub), while Lan and colleagues (2017) (https://www.nature.com/articles/nature23666#s1), for their model of mouse xenograft growth dynamics, estimated division rate to be about 0.02 to < 1.5 per day depending on whether a cell was a stem or progenitor cell. While division rate may not directly correspond to transcriptional activity, I'd imagine that quiescent cells don't move very quickly in transcriptional space.

We are capable of getting larger arrows for the data I showed if we set minimum average expression per cell in at least 1 cluster to 2.0 for spliced genes and 0.05 for unspliced genes, but again, these are inconsistent, as shown in one of the correlation plots above.