kharchenkolab / numbat

Haplotype-aware CNV analysis from single-cell RNA-seq
https://kharchenkolab.github.io/numbat/
Other
156 stars 22 forks source link

comprehension for 'Using prephased SNP profiles' #158

Open hanjun98 opened 4 months ago

hanjun98 commented 4 months ago

Hi, first, Thank you for awesome tool to create new insight! I felt person who made numbat is so meticulous.

I have a question to 'Using prephased SNP profiles'

''' Using DNA-derived genotype information is another way to improve SNP density and phasing. If you have SNP calls from DNA genotyping (e.g. WGS/WES), you can first perform phasing on the DNA-derived VCF. Then run cellsnp-lite on scRNA-seq BAMs against the DNA-derived VCF to generate allele counts (only include heterozygous SNPs). Finally, merge the phased GT fields (from phased DNA-derived VCF) with the obtained allele counts to produce an allele dataframe in the format of df_allele (see section Preparing data). '''

  1. If you have SNP calls from DNA genotyping (e.g. WGS/WES), you can first perform phasing on the DNA-derived VCF. Then run cellsnp-lite on scRNA-seq BAMs against the DNA-derived VCF to generate allele counts (only include heterozygous SNPs). --> I already have VCF(derived from WES) so I performed phasing on the DNA-derived VCF like this

`pileup_R='/mnt/gmi-l1/_90.User_Data/dhthxkr/codes/MDS/sc-seq/numbat/pileup_and_phase.R' gmap='/mnt/gmi-l1/_90.User_Data/dhthxkr/tools/Eagle_v2.4.1/tables/genetic_map_hg38_withX.txt.gz'

snpvcf='/mnt/gmi-l1/_90.User_Data/dhthxkr/ref/numbat/genome1K.phase3.SNP_AF5e2.chr1toX.hg38.vcf'

paneldir='/mnt/gmi-l1/_90.User_Data/dhthxkr/ref/numbat/1000G_hg38' eagle='/mnt/gmi-l1/_90.User_Data/dhthxkr/tools/Eagle_v2.4.1/eagle'

sample=$1 WES=$2

mkdir -p ${sample}/Numbat_WES

Rscript ${pileup_R} \ --label ${sample} \ --samples ${sample} \ --bams ${sample}_cellranger/outs/possorted_genome_bam.bam \ #single cell cellranger ouput bam --barcodes ${sample}/leiden_v2.4/matrix_files/barcodes.tsv.gz \ #single cell barcodes --outdir ${sample}/Numbat_WES \ --gmap ${gmap} \ --eagle ${eagle} \ --snpvcf ${sample}/WES/${WES}.vcf.gz \ #WES derived VCF(contains germlines and somtics, only filter='PASS') --paneldir ${paneldir} \ --ncores 5 &`

so I can get a '${sample}/Numbat_WES/MDSC02_allele_counts.tsv.gz'
  1. Finally, merge the phased GT fields (from phased DNA-derived VCF) with the obtained allele counts to produce an allele dataframe in the format of df_allele. --> what's mean 'merge the phased GT fields'? Actually, I inferred that merging allele_counts.tsv(Pileup output with option(--snpvcf ${sample}/WES/${WES}.vcf.gz)) with allele_counts.tsv(Pileup output with option(--snpvcf genome1K.phase3.SNP_AF5e2.chr1toX.hg38.vcf))

so, I make a merged_allele_counts.tsv(from option(--snpvcf ${sample}/WES/${WES}.vcf.gz) allele_counts.tsv and option(--snpvcf genome1K.phase3.SNP_AF5e2.chr1toX.hg38.vcf)) allele_counts.tsv

And then run 'run_numbat' with merged_allele_counts.tsv like this

`#!/usr/bin/env Rscript args = commandArgs(trailingOnly=TRUE) sample <- args[1]

library(glue) library(stringr) library(data.table) library(dplyr) library(vcfR) library(Matrix) library(numbat) library(pagoda2) library(ggplot2) library(ggtree) library(tidygraph) library(patchwork)

df_allele=fread(paste0('/mnt/gmi-l1/_90.User_Data/dhthxkr/MDS/',sample,'/Numbat_WES_v2/',sample,'_merged_allele_counts.tsv.gz'), header = T) count_mat = readMM(paste0('/mnt/gmi-l1/_90.User_Data/dhthxkr/MDS/',sample,'/QC_output/matrix_files_v2.4/matrix.mtx.gz')) cells = fread(paste0('/mnt/gmi-l1/_90.User_Data/dhthxkr/MDS/',sample,'/QC_output/matrix_files_v2.4/barcodes.tsv.gz'), header = F)$V1 genes = fread(paste0('/mnt/gmi-l1/_90.User_Data/dhthxkr/MDS/',sample,'/QC_output/matrix_files_v2.4/features.tsv.gz'), header = F)$V2 colnames(count_mat) = cells rownames(count_mat) = genes count_mat = as.matrix(count_mat) count_mat = rowsum(count_mat, rownames(count_mat)) count_mat = as(count_mat, "dgCMatrix")

out = run_numbat( count_mat, # gene x cell integer UMI count matrix ref_hca, # reference expression profile, a gene x cell type normalized expression level matrix df_allele, # allele dataframe generated by pileup_and_phase script genome = "hg38", t = 1e-5, ncores = 20, plot = TRUE, out_dir = paste0('/mnt/gmi-l1/_90.User_Data/dhthxkr/MDS/',sample,'/Numbat_WES_v2') )`

But I got this error "Warning message: In asMethod(object) : sparse->dense coercion: allocating vector of size 2.4 GiB Error in check_allele_df(df_allele) : Inconsistent SNP genotypes; Are cells from two different individuals mixed together? Calls: run_numbat -> check_allele_df Execution halted"

So, I don't know when it wrong?

Thank you for reading! I'll be waiting for your answer!