parklab / scan-snv

Single cell somatic genotyper
10 stars 5 forks source link

Errors in running scansnv #3

Open qian0001 opened 4 years ago

qian0001 commented 4 years ago

I got the foolowing error messages for about 1/3 of my single cell wgs after running latest scansnv as of today (12/6/19). Any ideas? For those cells with errors in the log, some produced file somatic_genotypes.rda while the others didn't. Over 90% of steps were done for those problem cells. Thanks for assistance.

slurm-47955.out:Error in 0:dp : NA/NaN argument slurm-47955.out:Error in rule scansnv_genotype_spikein_scatter: slurm-47955.out:CalledProcessError in line 778 of /home/xxxxx/miniconda3/envs/scansnv/lib/scansnv/Snakefile:

jluquette commented 4 years ago

@qian0001 Would you mind posting the entire slurm-47955.out file? Or is that the entire thing?

jluquette commented 4 years ago

@qian0001 Are you still experiencing this issue?

qian0001 commented 4 years ago

Thanks for getting back. I am trying to run on a different cluster system for 100+ cells  today. Give me a few days and I'll report back if the same problem happens there.

On Tuesday, December 17, 2019, 5:20:42 PM EST, jluquette <notifications@github.com> wrote:  

@qian0001 Are you still experiencing this issue?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

qian0001 commented 4 years ago

Hi, I divided the jobs into individual chromosomes instead of all at once through a bed, and ran the jobs to another cluster and was able to finish all the jobs except one. I reduced some of the parameters from default to half as below then that failed one also completed successfully: --abmodel-chunks 10 --abmodel-samples-per-chunk 500 --abmodel-hsnp-chunk-size 50 --hsnp-spikein-replicates 50

I have some questions on the somatic variant types.   Out of 80,000+ somatic variants through scansnv from 100+ cells, 40%+ of them are homozygous (1/1 in the cell), such as chr     pos     refnt   altnt   bulk    singlecell10      102654490       T       C       0/0     1/1 10      110067127       C       A       0/0     1/1 10      116194634       C       A       0/0     0/1 10      128850598       G       A       0/0     1/1 10      132004258       G       A       0/0     1/1 10      134311110       G       A       0/0     1/1 10      1895892 G       A       0/0     1/1 10      37720217        G       A       0/0     1/1 10      3981066 C       T       0/0     1/1 10      55803203        G       A       0/0     0/1 10      57078512        G       T       0/0     0/1 10      82310254        G       A       0/0     0/1 10      83307891        G       A       0/0     1/1 10      85372996        C       A       0/0     1/1 10      85498193        G       A       0/0     1/1 10      91781082        C       T       0/0     1/1

But according to various publications, the number of homozygous somatic variants are so small that they can be omitted . Usually sSNV occurs on one copy of chromsome.  To have it occurs at the exact the same position on both copies of chromosomes is a rare event. Did I do anything wrong in the running the ScanSNV? The command I use looks like: scansnv --ref /data/human_g1k_v37_decoy.fasta --dbsnp dbsnp_138.b37.vcf --shapeit-panel 1000GP_Phase3 --output-dir OUTPUT/GT_102_2/1 --bam GT102 GT102.bam --bam GT_102_2 GT_102_2.bam --sc-sample GT_102_2 --bulk-sample GT102 --regions 1:1-249250621 --abmodel-chunks 20 --abmodel-samples-per-chunk 1000 --abmodel-hsnp-chunk-size 100 --hsnp-spikein-replicates 100 --joblimit 2 --resume The bam files are aligned to the above reference based on GATK guide line. Thank you for help. Happy new year. Yong

On Tuesday, December 17, 2019, 5:20:42 PM EST, jluquette <notifications@github.com> wrote:  

@qian0001 Are you still experiencing this issue?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

jluquette commented 4 years ago

Hi @qian0001,

Happy to hear that your run worked. A note about your parameter changes: I highly recommend that you do not decrease --abmodel-hsnp-chunk-size below the default value. We have found that sizes smaller than 100 can cause significant changes in the AB model fit.

Thank you for your question about the final output. All of the mutations called by SCAN-SNV are heterozygous. The columns you've shown here are actually just the raw genotype calls produced by GATK in the first step of SCAN-SNV's pipeline. These genotype calls are unused by SCAN-SNV and you should completely ignore them. The way to interpret the output is: for each "PASS" mutation in the data frame, there is a heterozygous somatic mutation at the indicated location.

Sorry for the confusion--this is clearly a case of lacking documentation. Please let me know if you have any other questions!

qian0001 commented 4 years ago

Hi, OK. The job that failed with default parameters was rerun with --abmodel-hsnp-chunk-size 50 and 200 (half and twice of default values).  Both re-run worked and sSNV numbers different by 1.

Now my new question is which files I should look to find the number of allheterozygous variants of a single cell from various output files of ScanSNV, soI can scale the number of sSNVs from covered region to the entire genome? I see a number of vcf files in the output (mmq1.vcf,mmq60.vcf in the scansnv folder). Not sure if I can use either of them.  

Thank you.

 

Yong

On Thursday, January 2, 2020, 11:36:44 AM EST, jluquette <notifications@github.com> wrote:  

Hi @qian0001,

Happy to hear that your run worked. A note about your parameter changes: I highly recommend that you do not decrease --abmodel-hsnp-chunk-size below the default value. We have found that sizes smaller than 100 can cause significant changes in the AB model fit.

Thank you for your question about the final output. All of the mutations called by SCAN-SNV are heterozygous. The columns you've shown here are actually just the raw genotype calls produced by GATK in the first step of SCAN-SNV's pipeline. These genotype calls are unused by SCAN-SNV and you should completely ignore them. The way to interpret the output is: for each "PASS" mutation in the data frame, there is a heterozygous somatic mutation at the indicated location.

Sorry for the confusion--this is clearly a case of lacking documentation. Please let me know if you have any other questions!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

jluquette commented 4 years ago

The called variants are in the R file

/path_to_your_output/scansnv/[single_cell_sample_name]/somatic_genotypes.rda

If you load this into R, the total number of heterozygous somatic mutations is simply:

R> load("/path_to_your_output/scansnv/[single_cell_sample_name]/somatic_genotypes.rda")
R> sum(somatic$pass)

If you want a rough estimate of the sensitivity of your calls, you can also look at the hSNP spikein analysis. In this analysis, a small number of hSNPs are withheld from the training set and then analyzed as if they were somatic mutation candidates.

R> load("/path_to_your_output/scansnv/[single_cell_sample_name]/hsnp_spikein_genotypes.rda")
R> mean(spikein$pass)

Typical values for sensitivity are 30-50%.

I should also mention that SCAN-SNV is only designed for high coverage, whole genome experiments (not exomes or WGS with mean coverage <15-20x). 100 such cells is a very large data set if it is indeed high depth and whole genome.

qian0001 commented 4 years ago

Thank you. There are large chunks of genome that are of no coverage or very little coverage of reads. I wanted to estimate the total number of sSNV.  So I need to know the total number of variants, not just somatic ones, for the single cells.  By comparing the total numbers of variants for bulk and single cells I can estimate how many somatic variants I missed in the uncovered or low-covered regions.

So I need to know which files contained all the single cell variants. I believe mmq1.vcf or mmq60.vcf in the scansnv subfolder contains those. I want to confirm with you before I can proceed. My single cell sequences are 30X on average. Yong

On Thursday, January 9, 2020, 11:57:05 AM EST, jluquette <notifications@github.com> wrote:  

The called variants are in the R file

/path_to_your_output/scansnv/[single_cell_sample_name]/somatic_genotypes.rda

If you load this into R, the total number of heterozygous somatic mutations is simply: R> load("/path_to_your_output/scansnv/[single_cell_sample_name]/somatic_genotypes.rda") R> sum(somatic$pass)

If you want a rough estimate of the sensitivity of your calls, you can also look at the hSNP spikein analysis. In this analysis, a small number of hSNPs are withheld from the training set and then analyzed as if they were somatic mutation candidates. R> load("/path_to_your_output/scansnv/[single_cell_sample_name]/hsnp_spikein_genotypes.rda") R> mean(spikein$pass)

Typical values for sensitivity are 30-50%.

I should also mention that SCAN-SNV is only designed for high coverage, whole genome experiments (not exomes or WGS with mean coverage <15-20x). 100 such cells is a very large data set if it is indeed high depth and whole genome.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

jluquette commented 4 years ago

I see. SCAN-SNV does not attempt to call non-somatic mutations except for the small number of hSNPs used in the spike-in analysis I described above. If you would like to do an ad-hoc analysis, the relevant files are:

gatk/hc_raw.mmq60.vcf - the raw output of joint GATK HaplotypeCaller on all samples (includes indels) scansnv/mmq60.vcf - the above VCF filtered down to biallelic SNPs that were also callable in bulk. callable in bulk just means not no-call ("./." genotype), so this includes sites where bulk is called homozygous reference ("0/0" genotype).

Just to be clear: these files are just GATK output. They are the input to SCAN-SNV but are not representative of SCAN-SNV mutation calls.

If your ultimate goal is to estimate the total number of somatic mutations, then I recommend using the hSNP spikein sensitivity method from the previous comment. This method is rough because hSNPs are not uniformly distributed over the genome (and especially not the phasable ones that SCAN-SNV uses), but is usually a pretty good guess. What fraction of your single cell genomes is usually at very low read depth? If it is reasonably low (ideally <10-20% but maybe even 50%), then I'd recommend trying the hSNP spikein sensitivity before an ad hoc analysis.

qian0001 commented 4 years ago

Thank you very much!Low read coverage areas (<20X) in my single cells are quite high. Mostly over 50% of the genomes are of low coverage.But  I'll look into the spikein option. Yong

On Thursday, January 9, 2020, 1:41:43 PM EST, jluquette <notifications@github.com> wrote:  

I see. SCAN-SNV does not attempt to call non-somatic mutations except for the small number of hSNPs used in the spike-in analysis I described above. If you would like to do an ad-hoc analysis, the relevant files are:

gatk/hc_raw.mmq60.vcf - the raw output of joint GATK HaplotypeCaller on all samples (includes indels) scansnv/mmq60.vcf - the above VCF filtered down to biallelic SNPs that were also callable in bulk. callable in bulk just means not no-call ("./." genotype), so this includes sites where bulk is called homozygous reference ("0/0" genotype).

If your ultimate goal is to estimate the total number of somatic mutations, then I recommend using the hSNP spikein sensitivity method from the previous comment. This method is rough because hSNPs are not uniformly distributed over the genome (and especially not the phasable ones that SCAN-SNV uses), but is usually a pretty good guess. What fraction of your single cell genomes is usually at very low read depth? If it is reasonably low (ideally <10-20% but maybe even 50%), then I'd recommend trying the hSNP spikein sensitivity before an ad hoc analysis.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.