kharchenkolab / numbat

Haplotype-aware CNV analysis from single-cell RNA-seq
https://kharchenkolab.github.io/numbat/
Other
163 stars 23 forks source link

About the phasing options #143

Open SirKuikka opened 11 months ago

SirKuikka commented 11 months ago

Hi,

I have a few questions related to the phasing options.

  1. Why is the default SNP panel limited to these variants: "7.4M SNPs with minor allele frequency (MAF) > 0.05:"? If I have tumor data with rare mutations, doesn't the limited SNP panel make it more difficult to detect allele-specific CNVs that are related to cancer? I mean why not use something like the COSMIC mutation panel?

  2. If I have my own DNA-derived genotype information in a VCF file (from WES), and I run /numbat/inst/bin/pileup_and_phase.R, what should the paneldir parameter be? I assume that --snpvcf should be the VCF file from WES.

    --snpvcf /data/genome1K.phase3.SNP_AF5e2.chr1toX.hg38.vcf \ --paneldir /data/1000G_hg38 \

Or do I have to generate the df_allele file manually as explained in the documentation:

"Using DNA-derived genotype information is another way to improve SNP density and phasing. If you have SNP calls from DNA genotyping (e.g. WGS/WES), you can first perform phasing on the DNA-derived VCF. Then run cellsnp-lite on scRNA-seq BAMs against the DNA-derived VCF to generate allele counts (only include heterozygous SNPs). Finally, merge the phased GT fields (from phased DNA-derived VCF) with the obtained allele counts to produce an allele dataframe in the format of df_allele (see section Preparing data)."

If I have to do it manually without the /numbat/inst/bin/pileup_and_phase.R function, how do I do the "phasing"?

teng-gao commented 10 months ago

Why is the default SNP panel limited to these variants: "7.4M SNPs with minor allele frequency (MAF) > 0.05:"? If I have tumor data with rare mutations, doesn't the limited SNP panel make it more difficult to detect allele-specific CNVs that are related to cancer? I mean why not use something like the COSMIC mutation panel?

The SNPs used for CNV detection are common germline variations. They're not somatic mutations.

If I have my own DNA-derived genotype information in a VCF file (from WES), and I run /numbat/inst/bin/pileup_and_phase.R, what should the paneldir parameter be?

If your VCF contains germline SNPs called from matched normal and tumor then you can directly phase it using eagle2 with the same reference panel. Then you perform cellsnp-lite pileup and make the df_allele. You can look into the pileup_and_phase script to see how this workflow is achieved.

In the future to facilitate this manual process, we can probably add a parameter in pileup_and_phase to skip genotyping step and accept all SNPs provided as is.