kharchenkolab / numbat

Haplotype-aware CNV analysis from single-cell RNA-seq
https://kharchenkolab.github.io/numbat/
Other
164 stars 23 forks source link

long vectors not supported yet #32

Closed josegarciamanteiga closed 2 years ago

josegarciamanteiga commented 2 years ago

Hi, Thanks for the package. Spectacular results with single cell RNASeq in tumors. I'd like to publish the identification of CAFs as normal cells in my tumors using it and it works smoothly in single datasets from 10X but I tried to mix three samples and got this error:

..../.... Retesting CNVs.. Retesting CNVs.. Retesting CNVs.. Retesting CNVs.. Retesting CNVs.. Finishing.. Finishing.. Finishing.. Finishing.. Finishing.. Error in vec_slice(x_out, x_slicer) : long vectors not supported yet: ../../src/include/Rinlinedfuns.h:535

Error: Tibble columns must have compatible sizes.

I used pileup_and_phase.R without problems on a bam merged from the cellranger bams where I substituted the "-1" at the end of the barcodes to avoid collisions after using cellranger aggr to generate barcodes. The error is thrown by run_numbat run with 64GB and 12 cores. Thanks for the help Jose

evanbiederstedt commented 2 years ago

Hi @josegarciamanteiga

The error is actually from R itself: https://github.com/wch/r-source/blob/trunk/src/include/Rinlinedfuns.h

This used to be a more common error in R before version...3 maybe?

There's possibly something we could do to fix this. We'll investigate.

For context: https://stackoverflow.com/questions/24335692/large-matrices-in-r-long-vectors-not-supported-yet https://support.bioconductor.org/p/118016/

Best, Evan

teng-gao commented 2 years ago

Hi @josegarciamanteiga,

Thanks for the issue! Are the three samples from the same individual (so that they have the same germline SNP profile)? If so, there's no need to merge the bams manually; You can supply multiple BAMs and barcode files to pileup_and_phase.R and it will produce a consensus VCF for the individual the allele data frames for each sample. More details here: https://kharchenkolab.github.io/numbat/articles/numbat.html#preparing-data

Best, Teng

josegarciamanteiga commented 2 years ago

Dear Teng, Thanks for the reply! Two out of three are indeed from the same individual. I have used them now to run pileup_and_phase.R as you advised and indeed produced the data without errors. But now, with run_numbat.R, how should I give the two gene x umi matrices and the allele data tables? My point would be to have the posteriors and all the numbat output taking into account both samples so that I can load it onto a Seurat/Pagoda scRNA-Seq that contains an integration of both datasets.

As for the 'long vectors error', it is strange since it is running with R 4.0.3, here the sessionInfo() for further details:

library(numbat) sessionInfo() R version 4.0.3 (2020-10-10) Platform: x86_64-conda-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core)

Matrix products: default BLAS: /home/garciamanteiga.jose/.conda/envs/numbat/lib/libblas.so.3.8.0 LAPACK: /home/garciamanteiga.jose/.conda/envs/numbat/lib/liblapack.so.3.8.0

locale: [1] LC_CTYPE=en_US.utf-8 LC_NUMERIC=C [3] LC_TIME=en_US.utf-8 LC_COLLATE=en_US.utf-8 [5] LC_MONETARY=en_US.utf-8 LC_MESSAGES=en_US.utf-8 [7] LC_PAPER=en_US.utf-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.utf-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] numbat_0.1.0

loaded via a namespace (and not attached): [1] treeio_1.14.4 tidyselect_1.1.1 purrr_0.3.4 [4] graphlayouts_0.8.0 lattice_0.20-45 ggfun_0.0.5 [7] colorspace_2.0-3 vctrs_0.3.8 generics_0.1.2 [10] viridisLite_0.4.0 utf8_1.2.2 gridGraphics_0.5-1 [13] rlang_0.4.12 pillar_1.7.0 glue_1.6.2 [16] DBI_1.1.2 tweenr_1.0.2 rvcheck_0.1.8 [19] lifecycle_1.0.1 stringr_1.4.0 munsell_0.5.0 [22] gtable_0.3.0 parallel_4.0.3 fansi_1.0.2 [25] tidygraph_1.2.0 Rcpp_1.0.7 scales_1.1.1 [28] BiocManager_1.30.16 jsonlite_1.8.0 farver_2.1.0 [31] gridExtra_2.3 ggforce_0.3.3 ggplot2_3.3.2 [34] digest_0.6.29 aplot_0.1.2 stringi_1.7.6 [37] dplyr_1.0.7 ggrepel_0.9.1 polyclip_1.10-0 [40] grid_4.0.3 ggtree_2.4.2 tools_4.0.3 [43] yulab.utils_0.0.4 logger_0.2.2 magrittr_2.0.2 [46] lazyeval_0.2.2 patchwork_1.1.1 tibble_3.1.6 [49] ggraph_2.0.5 crayon_1.5.0 ape_5.6-2 [52] tidyr_1.1.2 pkgconfig_2.0.3 tidytree_0.3.9 [55] MASS_7.3-55 ellipsis_0.3.2 data.table_1.14.2 [58] ggplotify_0.1.0 extraDistr_1.9.1 assertthat_0.2.1 [61] viridis_0.6.2 R6_2.5.1 igraph_1.2.11 [64] nlme_3.1-155 compiler_4.0.3

teng-gao commented 2 years ago

Hi @josegarciamanteiga,

The error occurred because there were more than one individual's genotypes in the allele data. Only data from the same individual should be provided to pileup_and_phase.R and run_numbat. If you have two samples from the same individual, you can concatenate the gene count matrices (e.g. cbind) and allele dataframes (e.g. rbind) as input to run_numbat. If the third sample belongs to a separate individual, I would run it separately. If you want to plot the single-cell posteriors in an integrated expression embedding from different samples/individuals, you can combine the posterior dataframes (e.g. nb$joint_post, nb$clone_post) after reading in the results for each individual separately. For more info on the output, please see this tutorial.

Thanks, Teng

josegarciamanteiga commented 2 years ago

Ok, thanks for the info. I thought something like that (cbind/rbind) would be the solution. As for showing the posteriors of different samples combined on an integrated object, that was my first working hypothesis, but I was not sure I could then interpret the results well, as the posteriors for normal vs tumor are intra-sample. Indeed, my point of analyzing them together was because in one sample I had very few normal cells and I wondered whether running phasing/numbat altogether could aid, but now I see it is not possible for the individual germline in normal cells is key. I think I will then visualize the single-dataset calls (for samples coming from different individuals) on an integrated object. Thanks again for a great package and help! Jose


Jose M. Garcia Manteiga PhD Computational Biologist Center for Translational Genomics and BioInformatics Dibit2-Basilica, 4A3 San Raffaele Scientific Institute Via Olgettina 58, 20132 Milano (MI), Italy

Tel: +39-02-2643-9211 e-mail: @.***

Il giorno mar 17 mag 2022 alle ore 05:52 Teng Gao @.***> ha scritto:

Hi @josegarciamanteiga https://github.com/josegarciamanteiga,

The error occurred because there were more than one individual's genotypes in the allele data. Only data from the same individual should be provided to pileup_and_phase.R and run_numbat. If you have two samples from the same individual, you can concatenate the gene count matrices (e.g. cbind) and allele dataframes (e.g. rbind) as input to run_numbat. If the third sample belongs to a separate individual, I would run it separately. If you want to plot the single-cell posteriors in an integrated expression embedding from different samples/individuals, you can combine the posterior dataframes (e.g. nb$joint_post, nb$clone_post) after reading in the results for each individual separately. For more info on the output, please see this tutorial https://kharchenkolab.github.io/numbat/articles/visualization.html#single-cell-cnv-calls .

Thanks, Teng

— Reply to this email directly, view it on GitHub https://github.com/kharchenkolab/numbat/issues/32#issuecomment-1128378708, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2UOMIPYRKOCJCLJGYDQVDVKMJXFANCNFSM5V3YLCSQ . You are receiving this because you were mentioned.Message ID: @.***>