lima1 / PureCN

Copy number calling and variant classification using targeted short read sequencing
https://bioconductor.org/packages/devel/bioc/html/PureCN.html
Artistic License 2.0
128 stars 32 forks source link

corrected input vcf for PureCN #86

Closed qindan2008 closed 5 years ago

qindan2008 commented 5 years ago

Hello,

The input vcf for PureCN should contain both germline and somatic variants. Does this mean both the germline and somatic vcf from variant caller should be combined together into a single one ? Or, just flag the variants recored in germline databases in the somatic vcf ? Or, something else?

I just want to evaluate tumor purity and ploidy. I got vcf files with varscan2. Then, the high confidence germline and somatic vcf was combined together into a single vcf file, and annotated with dbSNP, marked the INFO with "DB" flag to indicate the variant was recored in the dbSNP database.

Run PureCN with the following command:

Rscript PureCN.R --out Tumor1/Tumor1 \ --normal PureCN_Coverage/Normal1.sort.dedup_coverage_loess.txt \ --tumor PureCN_Coverage/Tumor1.sort.dedup_coverage_loess.txt \ --vcf Tumor1_Normal1.varscan2.combinded.germline.somatic.vcf \ --parallel \ --cores 6 \ --sampleid Patient1 \ --intervals baits_hg19_intervals.txt \ --segfile Tumor1.segments \ --genome hg19

Both the coverage files and intervalFile was generated from PureCN. The segment file was generated from CNVkit. The above command completed without errors.

Is this OK ?

lima1 commented 5 years ago

Hi,

PureCN can calculate cancer cell fractions (CCFs) of somatic mutations. If you are not interested in sub-clonality of somatic mutations, you can only provide germline calls. For callers other than Mutect1, we don't do a lot of artifact filtering (except for low coverage, poor read count support and poor base quality), so you want to do that yourself. Let me know if you run into problems, possible that Varscan2 VCFs don't work out-of-the-box.

If you have a pool of normals, merge the normals in a single multi-sample VCF and follow the Quick vignette. This will remove variants not in dbSNP, but present in the PoN. It will also automatically clean up variants in poor quality regions.

If you want to use CNVkit, follow the example in the Quick vignette, no need to recalculate coverages again with PureCN.

Best, Markus

lima1 commented 5 years ago

--parallel will use the default BiocParallel backend, whereas --cores picks a standard BiocParallel backend with 6 cores. So use either --parallel and define a default backend in your ~/.Rprofile or use --cores

qindan2008 commented 5 years ago

Yes, I just modified Varscan2 output VCFs to fit PureCN. Thank you very much.

vedellpt commented 5 years ago

I have some questions that are closely related to this post. First, I will comment that I have been trying to get PureCN to run but all variants of my input vcf which is probably not of the desired format are being filtered. In general, I would like to know the answers to the first questions of qindan2008:

Does this mean both the germline and somatic vcf from variant caller should be combined together into a single one ? Or, just flag the variants recored in germline databases in the somatic vcf ? Or, something else?

More particularly, I noticed that the example vcfs in the R extdata folder example_vcf.vcf.gz and example.vcf.gz appear to Mutect (1) output, presumably version 1.1.7. Is an unfiltered Mutect version 1.1.7 vcf sufficient to use as a PureCN input vcf? Would this file contain enough of the germline variants? If so, that is good to know and I will try it. Or, should I separate apply a germline or multisample caller like haplotype caller or unified genotyper and then concatenate the resulting vcf with the vcf from the somatic caller (e.g. Mutect)? If so, what fields need to be present in both vcfs. (From reviewing the default vcf filtering functions, I think FA is important to have). Also, if so, what germline caller do you recommend and do you have any suggestions on how to accomplish the vcf concatenation in a way that results in a vcf that is properly formatted for PureCN?

Thanks for your help.

lima1 commented 5 years ago

Hi,

yes, Mutect 1.1.7 reliably calls germline variants. PureCN will remove all flagged variants when you provide the Mutect stats file, except for variants flagged being germline.

We rarely have matched normals and need the somatic variants, so I have unfortunately little experience with germline callers.

I'm currently working on getting full support for GATK4 in PureCN, this should make it in the next stable release.

If you are concerned about the quality of the germline calls, I would probably run HC on a panel of normals, filter stringently for quality and mapping bias (when average allelic fractions of heteozygous variants != 0.5), and then simply genotype those high quality sites. Especially if you have WES data and don't mind losing those <100 or so private SNPs.

VCF needs depths in an AD FORMAT field. It also needs a SOMATIC INFO flag, but it might add that for you. It currently also needs either a DB INFO flag for membership in a germline database, or alternatively population allele frequencies in a POP_AP INFO field.

vedellpt commented 5 years ago

Hi,

Thanks for your quick and detailed reply. I will first run Mutect 1.1.7 and try to use the resulting vcfs as the input to PureCN

Thanks!

vedellpt commented 5 years ago

Hi,

Thanks for your quick and detailed reply. I will first run Mutect 1.1.7 and try to use the resulting vcfs as the input to PureCN.

Thanks!

From: M. Riester [mailto:notifications@github.com] Sent: Friday, August 09, 2019 1:35 PM To: lima1/PureCN Cc: Vedell, Peter T. (Pete); Comment Subject: [EXTERNAL] Re: [lima1/PureCN] corrected input vcf for PureCN (#86)

Hi,

yes, Mutect 1.1.7 reliably calls germline variants. PureCN will remove all flagged variants when you provide the Mutect stats file, except for variants flagged being germline.

We rarely have matched normals and need the somatic variants, so I have unfortunately little experience with germline callers.

I'm currently working on getting full support for GATK4 in PureCN, this should make it in the next stable release.

If you are concerned about the quality of the germline calls, I would probably run HC on a panel of normals, filter stringently for quality and mapping bias (when average allelic fractions of heteozygous variants != 0.5), and then simply genotype those high quality sites. Especially if you have WES data and don't mind loosing those <100 or so private SNPs.

VCF needs depths in an AD FORMAT field. It also needs a SOMATIC INFO flag, but it might add that for you. It currently also needs either a DB INFO flag for membership in a germline database, or alternatively population allele frequencies in a POP_AP INFO field.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/lima1/PureCN/issues/86?email_source=notifications&email_token=ACYGL343LCCA5WMUOSDWKOLQDW2DRA5CNFSM4HN6F7N2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD37OKJQ#issuecomment-520021286, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ACYGL34NKFRW2FAW3JZNGNTQDW2DRANCNFSM4HN6F7NQ.

lima1 commented 5 years ago

In case you are trying, let me know when you find a tumor/normal solution that works better than 1.1.7. But it worked well for us.

vedellpt commented 5 years ago

OK. For this particular data set that I’m currently working with, I think that generating the Mutect 1.1.7 vcfs make sense. I have another data set where I already have vcfs from applying the GATK 4.1.1.0 Mutect2 Somatic SNV/INDEL best practices workflow. I may try to use that. I will plan to let you know how it goes. Thanks.

From: M. Riester [mailto:notifications@github.com] Sent: Friday, August 09, 2019 1:47 PM To: lima1/PureCN Cc: Vedell, Peter T. (Pete); Comment Subject: [EXTERNAL] Re: [lima1/PureCN] corrected input vcf for PureCN (#86)

In case you are trying, let me know when you find a tumor/normal solution that works better than 1.1.7. But it worked well for us.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/lima1/PureCN/issues/86?email_source=notifications&email_token=ACYGL3Y5AFI62BWJQZLTJD3QDW3Q7A5CNFSM4HN6F7N2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD37PHHA#issuecomment-520024988, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ACYGL35YZZKKNVKZSX7NI6LQDW3Q7ANCNFSM4HN6F7NQ.

vedellpt commented 5 years ago

For the test sample I am using, I called somatic variants using Mutect and used that as the input vcf for PureCN and got past the "all variants filtered" error. Thanks again for your help with that. I got a different error.

Error in colnames<-(*tmp*, value = c(paste("SOMATIC.M", test.num.copy, : attempt to set 'colnames' on an object with less than two dimensions

I didn't yet find a post that helped me with this error. I would be glad for any suggestions about two to resolve this error. Below I show the command arguments that I used and the complete log file content:

Command:

tmp_runAbsoluteCN <- runAbsoluteCN(

  • log.ratio = tmpLog2ratio_exonoid_medianLogRatio_ARbalancedadjusted,
  • vcf.file = tmpVcfCombinedFile,
  • genome="hg19",
  • sex = "F",
  • fun.segmentation = segmentationCBS,
  • args.segmentation = list(undo.SD=1),
  • fun.focal = findFocal,
  • args.focal = list(),
  • sampleid = smplNm,
  • min.ploidy = 1,
  • max.ploidy = 6,
  • test.num.copy = 0:7,
  • test.purity = seq(0.15, 0.95, by = 0.01),
  • prior.K = 0.999,
  • prior.contamination = 0.01,
  • max.candidate.solutions = 20,
  • candidates = NULL,
  • min.coverage = 15,
  • max.coverage.vcf = 2000,
  • max.non.clonal = 0.2,
  • max.homozygous.loss = c(0.05, 1e+07),
  • non.clonal.M = 1/3,
  • max.mapping.bias = 0.8,
  • max.pon = 3,
  • iterations = 30,
  • min.variants.segment = 5,
  • log.ratio.calibration = 0.1,
  • smooth.log.ratio = TRUE,
  • model.homozygous = FALSE,
  • error = 0.001,
  • interval.file = PureCN_interval.file,
  • max.dropout = c(0.95, 1.1),
  • min.logr.sdev = 0.15,
  • max.logr.sdev = 0.6,
  • max.segments = 300,
  • min.gof = 0.8,
  • plot.cnv = TRUE,
  • DB.info.flag = "DB",
  • POPAF.info.field = "POP_AF",
  • min.pop.af = 0.001,
  • model = c("betabin"),
  • post.optimize = FALSE,
  • speedup.heuristics = 2,
  • log.file = paste(sep="",smplNm,".log"),
  • verbose = TRUE
  • )

Log file content:

[user4243@srvr07 R]$ cat Sample001_Tumor.log INFO [2019-08-11 02:49:57] ------------------------------------------------------------ INFO [2019-08-11 02:49:57] PureCN 1.12.2 INFO [2019-08-11 02:49:57] ------------------------------------------------------------ INFO [2019-08-11 02:49:57] Arguments: -vcf.file /data/055.CNV/output/Sample001_Tumor_mutect.vcf -genome hg19 -sex F -args.segmentation 1 -args.focal -sampleid Sample001_Tumor -min.ploidy 1 -max.ploidy 6 -prior.K 0.999 -prior.contamination 0.01 -max.candidate.solutions 20 -candidates -min.coverage 15 -max.coverage.vcf 2000 -max.non.clonal 0.2 -max.homozygous.loss 0.05,1e+07 -non.clonal.M 0.333333333333333 -max.mapping.bias 0.8 -max.pon 3 -iterations 30 -min.variants.segment 5 -log.ratio.calibration 0.1 -smooth.log.ratio TRUE -model.homozygous FALSE -error 0.001 -interval.file /data/055.CNV/output/intervals.txt -max.dropout 0.95,1.1 -min.logr.sdev 0.15 -max.logr.sdev 0.6 -max.segments 300 -min.gof 0.8 -plot.cnv TRUE -DB.info.flag DB -POPAF.info.field POP_AF -min.pop.af 0.001 -model betabin -post.optimize FALSE -speedup.heuristics 2 -log.file Sample001_Tumor.log -verbose TRUE -log.ratio -fun.segmentation -fun.focal -test.num.copy -test.purity INFO [2019-08-11 02:49:57] Loading coverage files... WARN [2019-08-11 02:50:01] Target intervals were not sorted. WARN [2019-08-11 02:50:06] tumor.coverage.file and interval.file do not align. INFO [2019-08-11 02:51:46] Removing 332 low/high GC targets. WARN [2019-08-11 02:51:46] No normalDB provided. Provide one for better results. INFO [2019-08-11 02:51:46] Using 171565 intervals (171183 on-target, 382 off-target). INFO [2019-08-11 02:51:46] Ratio of mean on-target vs. off-target read counts: NaN INFO [2019-08-11 02:51:46] Mean off-target bin size: 566 INFO [2019-08-11 02:51:47] AT/GC dropout: 1.00 (tumor), 1.00 (normal). INFO [2019-08-11 02:51:47] Loading VCF... INFO [2019-08-11 02:51:48] Found 74518 variants in VCF file. INFO [2019-08-11 02:51:49] 62705 (84.1%) variants annotated as likely germline (DB INFO flag). INFO [2019-08-11 02:51:49] s_Sample001_Tumor is tumor in VCF file. INFO [2019-08-11 02:51:50] 396 homozygous and 1154 heterozygous variants on chrX. INFO [2019-08-11 02:51:50] Sex from VCF: F (Fisher's p-value: < 0.0001, odds-ratio: 0.76). INFO [2019-08-11 02:51:51] Removing 42439 non heterozygous (in matched normal) germline SNPs. INFO [2019-08-11 02:51:54] Initial testing for significant sample cross-contamination: unlikely INFO [2019-08-11 02:51:55] Removing 471 variants with AF < 0.030 or AF >= 1.000 or less than 5 supporting reads or depth < 15. INFO [2019-08-11 02:51:55] Removing 68 low quality variants with BQ < 25. INFO [2019-08-11 02:51:55] Total size of targeted genomic region: 0.00Mb (0.00Mb with 50bp padding). INFO [2019-08-11 02:51:56] 0.0% of targets contain variants. INFO [2019-08-11 02:51:56] Removing 31540 variants outside intervals. INFO [2019-08-11 02:51:56] Found SOMATIC annotation in VCF. INFO [2019-08-11 02:51:56] Setting somatic prior probabilities for somatic variants to 0.999000 or to 0.000100 otherwise. INFO [2019-08-11 02:51:56] Found SOMATIC annotation in VCF. Setting mapping bias to NaN. INFO [2019-08-11 02:51:56] Sample sex: F INFO [2019-08-11 02:51:56] Segmenting data... INFO [2019-08-11 02:51:57] Loading pre-computed boundaries for DNAcopy... INFO [2019-08-11 02:51:57] Setting undo.SD parameter to 1.000000. INFO [2019-08-11 02:52:40] Setting prune.hclust.h parameter to 0.100000. INFO [2019-08-11 02:52:41] Found 158 segments with median size of 1.26Mb. INFO [2019-08-11 02:52:41] Using 0 variants. INFO [2019-08-11 02:52:50] Mean standard deviation of log-ratios: 0.16 INFO [2019-08-11 02:52:50] 2D-grid search of purity and ploidy... INFO [2019-08-11 03:02:55] Local optima: 0.38/2, 0.58/2.6, 0.19/2, 0.38/1.6, 0.21/2.2, 0.58/3.2, 0.42/2.4, 0.82/3.6, 0.27/2.8, 0.15/2.6, 0.29/3.2, 0.52/2, 0.23/1.8, 0.48/4.4, 0.45/3.8, 0.52/3.6, 0.38/2.8, 0.31/3.6, 0.58/4.4, 0.65/5.2, 0.92/2 INFO [2019-08-11 03:02:55] Testing local optimum 1/21 at purity 0.38 and total ploidy 2.00... INFO [2019-08-11 03:03:52] Fitting variants for purity 0.39, tumor ploidy 2.03 and contamination 0.01. [user4243@srvr07 R]$

lima1 commented 5 years ago

There seems to be something very wrong with your interval file, not sure what, never seen this before. Unfortunately this means you need to start from scratch.

Can you update to latest stable; if you don't want to update Bioconductor, simply download from Github with BiocManager::install("lima1/PureCN", ref="RELEASE_3_9")?

Then I would highly recommend following the Quick vignette step by step.

vedellpt commented 5 years ago

OK. Yes, I see what you mean as the line

INFO [2019-08-11 02:51:56] 0.0% of targets contain variants.

does not make sense. OK. yes, I will download the latest version and work with the Quick vignette. Thanks.

vedellpt commented 5 years ago

I found the problem. In my interval file, I was coding the on_target column as a binary indicator variable, taking on values of 1 and 0. But, from running PureCN vignette code, I observed that this column needs to have logical values (TRUE and FALSE).

lima1 commented 5 years ago

Interesting, thanks. I'll add a check for that.

I would still highly recommend to follow the Quick vignette, especially for the interval file. It does a lot of checks and optimizations of your input BED file.

vedellpt commented 5 years ago

You’re welcome. Thanks for your reply. Yes, I see the Quick vignette workflow has some good steps for the interval file. I have been planning to run PureCN at least twice for the same data set. The first run would be using the log2 ratios and interval file from another workflow that I have already run and the second run would be employing the complete PureCN workflow as outlined in the Quick vignette. But, perhaps I could do a 3rd run using log2 ratios from the other workflow and the interval file as created using the Quick vignette or just make this the first run. Anyway, I am glad to report that, for one sample of this data set, I did get a successful completion of a call to runAbsoluteCN so I am glad to see that. Thanks for your help.

From: M. Riester [mailto:notifications@github.com] Sent: Monday, August 12, 2019 7:55 AM To: lima1/PureCN Cc: Vedell, Peter T. (Pete); Comment Subject: [EXTERNAL] Re: [lima1/PureCN] corrected input vcf for PureCN (#86)

Interesting, thanks. I'll add a check for that.

I would still highly recommend to follow the Quick vignette, especially for the interval file. It does a lot of checks and optimizations of your input BED file.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/lima1/PureCN/issues/86?email_source=notifications&email_token=ACYGL34TCBMXVM6U2BD7PALQEFMS3A5CNFSM4HN6F7N2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4COAFI#issuecomment-520413205, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ACYGL35QXQMHD2L3PQVCKD3QEFMS3ANCNFSM4HN6F7NQ.

lima1 commented 5 years ago

Yes, the Quick workflow should give you close to ideal results, it's very well tested.

One thing I recommend is using my patched version of the PSCBS package:

if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("lima1/PSCBS", ref="add_dnacopy_weighting")

Use this in PureCN.R with --funsegmentation PSCBS. (Currently not the default because it needs the patch).

Also, if you have WES data, off-target reads might not add much. The NormalDB.R script plots noise vs segment size. If you see a higher noise in off-target, you might want to repeat without.

lima1 commented 5 years ago

Especially since in your posted log-file, the off-target regions look weird (small and you have few of them).

vedellpt commented 5 years ago

Thanks for the additional information and recommendations and the patch, I appreciate it.

On Aug 12, 2019, at 7:15 PM, M. Riester notifications@github.com<mailto:notifications@github.com> wrote:

Especially since in your posted log-file, the off-target regions look weird (small and you have few of them).

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/lima1/PureCN/issues/86?email_source=notifications&email_token=ACYGL33UUB6GYQH5MB5FRY3QEH4KTA5CNFSM4HN6F7N2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4EFU6Y#issuecomment-520641147, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ACYGL3ZKAAC57TJSCEAUV2LQEH4KTANCNFSM4HN6F7NQ.

vedellpt commented 5 years ago

Hi,

Thanks again for your help. I upgraded to PureCN 1.14.2 and installed PSCBS from your source. I found a parameter set for IntervalFile.R that filters intervals due to poor mappability but didn't divide them. This way, it was convenient to use the log2 ratios that I already have. I made a modified version of PureCN.R that accepts the log2 ratios ( and also prior.contamination since I have contamination estimates for these samples). I applied this script to the test set that I used previously. It looks like the processing almost completed without error. I did get an error though.

Here is the command that I submitted followed by the error file and the log file:

Command: :::::::::::::::::::::::::

[user4243@srvr07 086]$ Rscript ${PURECN}/PureCN__1.14.2_modified.R \

--sampleid=${smplNm} \ --logratio=${logratio_filepath} \ --vcf=${mutectvcf_filepath} \ --rds=${smplNm}_PureCN_run001.rds \ --sex=F \ --genome=hg19 \ --intervals=${INTERVALS_FILEPATH} \ --statsfile=${mutectstats_filepath} \ --minaf=${MINAF} \ --funsegmentation=PSCBS \ --minpurity=${MINPURITY} \ --maxpurity=${MAXPURITY} \ --maxcopynumber=${MAXCOPYNUMBER} \ --priorcontamination=${PRIORCONTAMINATION} \ --outvcf \ --out=${OUT} \ --seed=${SEEDVALU} \

${smplNm}.out \ 2> ${smplNm}.err &

Error File: :::::::::::::::::::::

Error in UseMethod("callLOH") : no applicable method for 'callLOH' applied to an object of class "list" Calls: write.csv ... cbind -> standardGeneric -> eval -> eval -> eval -> callLOH Execution halted

Log File: ::::::::::::::::::

/home/biotools/rpackages/R-3.5.2-latest /home/biotools/r/R-3.5.2/lib64/R/library [1] "/data/086/output" INFO [2019-08-15 19:14:19] Loading PureCN 1.14.2... INFO [2019-08-15 19:14:20] ------------------------------------------------------------ INFO [2019-08-15 19:14:20] PureCN 1.14.2 INFO [2019-08-15 19:14:20] ------------------------------------------------------------ INFO [2019-08-15 19:14:20] Arguments: -vcf.file /data/085.Mutect/output/Sample001_Tumor_mutect.vcf -genome hg19 -sex F -args.setPriorVcf 6 -args.setMappingBiasVcf NU LL -args.segmentation NULL,NULL,0.005,NULL -sampleid Sample001_Tumor -min.ploidy 1 -max.ploidy 6 -prior.contamination 0.01 -max.non.clonal 0.2 -max.homozygous.loss 0.05,1e+07 -log.ratio.calibration 0.1 -model.homozygous FALSE -error 0.0 01 -interval.file /data/055.CNV/output/PureCN_run001/reference_files/intervals_PureCN_formatted__ParameterSet004.txt -max.segments 300 -plot.cnv TRUE -DB.info.flag DB -POPAF.info.field POP_AF -model beta -post.optimize FALSE -BPPARAM -log.file /data/086/output/Sample001_Tumor.log -log.ratio -args.filterVcf -fun.seg mentation -test.num.copy -test.purity -speedup.heuristics INFO [2019-08-15 19:14:20] Loading coverage files... INFO [2019-08-15 19:15:17] Removing 331 low/high GC targets. WARN [2019-08-15 19:15:17] No normalDB provided. Provide one for better results. INFO [2019-08-15 19:15:17] Using 169470 intervals (169470 on-target, 0 off-target). INFO [2019-08-15 19:15:17] No off-target intervals. If this is hybrid-capture data, consider adding them. INFO [2019-08-15 19:15:20] AT/GC dropout: 1.00 (tumor), 1.00 (normal). INFO [2019-08-15 19:15:20] Loading VCF... INFO [2019-08-15 19:15:22] Found 74518 variants in VCF file. INFO [2019-08-15 19:15:23] 62705 (84.1%) variants annotated as likely germline (DB INFO flag). INFO [2019-08-15 19:15:26] s_Sample001_Tumor is tumor in VCF file. INFO [2019-08-15 19:15:27] 396 homozygous and 1154 heterozygous variants on chrX. INFO [2019-08-15 19:15:27] Sex from VCF: F (Fisher's p-value: < 0.0001, odds-ratio: 0.76). INFO [2019-08-15 19:15:33] Removing 14377 MuTect calls due to blacklisted failure reasons. INFO [2019-08-15 19:15:36] Removing 28879 non heterozygous (in matched normal) germline SNPs. INFO [2019-08-15 19:15:40] Initial testing for significant sample cross-contamination: unlikely INFO [2019-08-15 19:15:41] Removing 162 variants with AF < 0.030 or AF >= 1.000 or less than 5 supporting reads or depth < 15. INFO [2019-08-15 19:15:41] Removing 5 low quality variants with BQ < 25. INFO [2019-08-15 19:15:41] Total size of targeted genomic region: 96.40Mb (110.22Mb with 50bp padding). INFO [2019-08-15 19:15:42] 13.0% of targets contain variants. INFO [2019-08-15 19:15:42] Removing 3833 variants outside intervals. INFO [2019-08-15 19:15:42] Found SOMATIC annotation in VCF. INFO [2019-08-15 19:15:42] Setting somatic prior probabilities for somatic variants to 0.999000 or to 0.000100 otherwise. INFO [2019-08-15 19:15:43] Found SOMATIC annotation in VCF. Setting mapping bias to 0.969. INFO [2019-08-15 19:15:43] Excluding 134 novel or poor quality variants from segmentation. INFO [2019-08-15 19:15:43] Sample sex: F INFO [2019-08-15 19:15:43] Segmenting data... INFO [2019-08-15 19:15:45] Using unweighted PSCBS. INFO [2019-08-15 19:15:45] Setting undo.SD parameter to 0.750000. INFO [2019-08-15 19:17:25] Setting undo.SD parameter to 1.125000. INFO [2019-08-15 19:18:59] Setting prune.hclust.h parameter to 0.100000. INFO [2019-08-15 19:19:02] Found 391 segments with median size of 0.93Mb. INFO [2019-08-15 19:19:02] Using 27262 variants. INFO [2019-08-15 19:19:03] Mean standard deviation of log-ratios: 0.17 INFO [2019-08-15 19:19:03] 2D-grid search of purity and ploidy... INFO [2019-08-15 19:29:13] Local optima: 0.38/2, 0.58/2.6, 0.2/2, 0.38/1.6, 0.38/2.4, 0.58/3.2, 0.2/2.2, 0.1/2.2, 0.26/2.8, 0.92/3.8, 0.3/2.6, 0.2/2.8, 0.52/2, 0.24/1.8, 0.45/3.8, 0.48/4.4, 0.35/3.8, 0.45/3.4, 0.58/4.4, 0.65/5.2, 0.92/2 INFO [2019-08-15 19:29:13] Testing local optimum 1/21 at purity 0.38 and total ploidy 2.00... INFO [2019-08-15 19:30:40] Testing local optimum 2/21 at purity 0.58 and total ploidy 2.60... INFO [2019-08-15 19:31:47] Testing local optimum 3/21 at purity 0.20 and total ploidy 2.00... INFO [2019-08-15 19:32:00] Recalibrating log-ratios... INFO [2019-08-15 19:32:00] Testing local optimum 3/21 at purity 0.20 and total ploidy 2.00... INFO [2019-08-15 19:32:13] Recalibrating log-ratios... INFO [2019-08-15 19:32:13] Testing local optimum 3/21 at purity 0.20 and total ploidy 2.00... INFO [2019-08-15 19:32:27] Recalibrating log-ratios... INFO [2019-08-15 19:32:27] Testing local optimum 3/21 at purity 0.20 and total ploidy 2.00... INFO [2019-08-15 19:32:41] Testing local optimum 4/21 at purity 0.38 and total ploidy 1.60... INFO [2019-08-15 19:32:56] Recalibrating log-ratios... INFO [2019-08-15 19:32:56] Testing local optimum 4/21 at purity 0.38 and total ploidy 1.60... INFO [2019-08-15 19:33:09] Recalibrating log-ratios... INFO [2019-08-15 19:33:09] Testing local optimum 4/21 at purity 0.38 and total ploidy 1.60... INFO [2019-08-15 19:33:23] Recalibrating log-ratios... INFO [2019-08-15 19:33:23] Testing local optimum 4/21 at purity 0.38 and total ploidy 1.60... INFO [2019-08-15 19:33:38] Testing local optimum 5/21 at purity 0.38 and total ploidy 2.40... INFO [2019-08-15 19:34:59] Testing local optimum 6/21 at purity 0.58 and total ploidy 3.20... INFO [2019-08-15 19:36:16] Testing local optimum 7/21 at purity 0.20 and total ploidy 2.20... INFO [2019-08-15 19:38:00] Testing local optimum 8/21 at purity 0.10 and total ploidy 2.20... INFO [2019-08-15 19:39:48] Testing local optimum 9/21 at purity 0.26 and total ploidy 2.80... INFO [2019-08-15 19:41:12] Testing local optimum 10/21 at purity 0.92 and total ploidy 3.80... INFO [2019-08-15 19:42:46] Testing local optimum 11/21 at purity 0.30 and total ploidy 2.60... INFO [2019-08-15 19:44:40] Testing local optimum 12/21 at purity 0.20 and total ploidy 2.80... INFO [2019-08-15 19:44:54] Recalibrating log-ratios... INFO [2019-08-15 19:44:54] Testing local optimum 12/21 at purity 0.20 and total ploidy 2.80... INFO [2019-08-15 19:45:08] Recalibrating log-ratios... INFO [2019-08-15 19:45:08] Testing local optimum 12/21 at purity 0.20 and total ploidy 2.80... INFO [2019-08-15 19:45:25] Recalibrating log-ratios... INFO [2019-08-15 19:45:25] Testing local optimum 12/21 at purity 0.20 and total ploidy 2.80... INFO [2019-08-15 19:45:40] Testing local optimum 13/21 at purity 0.52 and total ploidy 2.00... INFO [2019-08-15 19:47:00] Testing local optimum 14/21 at purity 0.24 and total ploidy 1.80... INFO [2019-08-15 19:48:50] Testing local optimum 15/21 at purity 0.45 and total ploidy 3.80... INFO [2019-08-15 19:49:04] Recalibrating log-ratios... INFO [2019-08-15 19:49:04] Testing local optimum 15/21 at purity 0.45 and total ploidy 3.80... INFO [2019-08-15 19:49:19] Recalibrating log-ratios... INFO [2019-08-15 19:49:19] Testing local optimum 15/21 at purity 0.45 and total ploidy 3.80... INFO [2019-08-15 19:49:34] Recalibrating log-ratios... INFO [2019-08-15 19:49:34] Testing local optimum 15/21 at purity 0.45 and total ploidy 3.80... INFO [2019-08-15 19:51:34] Testing local optimum 16/21 at purity 0.48 and total ploidy 4.40... INFO [2019-08-15 19:51:50] Recalibrating log-ratios... INFO [2019-08-15 19:51:50] Testing local optimum 16/21 at purity 0.48 and total ploidy 4.40... INFO [2019-08-15 19:52:04] Recalibrating log-ratios... INFO [2019-08-15 19:52:04] Testing local optimum 16/21 at purity 0.48 and total ploidy 4.40... INFO [2019-08-15 19:52:20] Recalibrating log-ratios... INFO [2019-08-15 19:52:20] Testing local optimum 16/21 at purity 0.48 and total ploidy 4.40... INFO [2019-08-15 19:52:36] Testing local optimum 17/21 at purity 0.35 and total ploidy 3.80... INFO [2019-08-15 19:52:49] Recalibrating log-ratios... INFO [2019-08-15 19:52:49] Testing local optimum 17/21 at purity 0.35 and total ploidy 3.80... INFO [2019-08-15 19:53:04] Recalibrating log-ratios... INFO [2019-08-15 19:53:04] Testing local optimum 17/21 at purity 0.35 and total ploidy 3.80... INFO [2019-08-15 19:55:36] Recalibrating log-ratios... INFO [2019-08-15 19:55:36] Testing local optimum 17/21 at purity 0.35 and total ploidy 3.80... INFO [2019-08-15 19:58:09] Testing local optimum 18/21 at purity 0.45 and total ploidy 3.40... INFO [2019-08-15 19:59:34] Testing local optimum 19/21 at purity 0.58 and total ploidy 4.40... INFO [2019-08-15 19:59:48] Recalibrating log-ratios... INFO [2019-08-15 19:59:48] Testing local optimum 19/21 at purity 0.58 and total ploidy 4.40... INFO [2019-08-15 20:01:27] Recalibrating log-ratios... INFO [2019-08-15 20:01:27] Testing local optimum 19/21 at purity 0.58 and total ploidy 4.40... INFO [2019-08-15 20:03:28] Recalibrating log-ratios... INFO [2019-08-15 20:03:28] Testing local optimum 19/21 at purity 0.58 and total ploidy 4.40... INFO [2019-08-15 20:06:10] Testing local optimum 20/21 at purity 0.65 and total ploidy 5.20... INFO [2019-08-15 20:06:24] Recalibrating log-ratios... INFO [2019-08-15 20:06:24] Testing local optimum 20/21 at purity 0.65 and total ploidy 5.20... INFO [2019-08-15 20:08:22] Recalibrating log-ratios... INFO [2019-08-15 20:08:22] Testing local optimum 20/21 at purity 0.65 and total ploidy 5.20... INFO [2019-08-15 20:10:25] Recalibrating log-ratios... INFO [2019-08-15 20:10:25] Testing local optimum 20/21 at purity 0.65 and total ploidy 5.20... INFO [2019-08-15 20:12:15] Testing local optimum 21/21 at purity 0.92 and total ploidy 2.00... INFO [2019-08-15 20:13:16] Skipping 3 solutions that converged to the same optima. INFO [2019-08-15 20:13:16] Skipping 4 solutions exceeding max.non.clonal (0.20). INFO [2019-08-15 20:13:16] Fitting variants for local optimum 1/21... INFO [2019-08-15 20:13:23] Fitting variants for purity 0.39, tumor ploidy 2.06 and contamination 0.01. INFO [2019-08-15 20:17:26] Optimized purity: 0.39 INFO [2019-08-15 20:17:26] Fitting variants for local optimum 2/21... INFO [2019-08-15 20:17:35] Fitting variants for purity 0.53, tumor ploidy 3.04 and contamination 0.01. INFO [2019-08-15 20:21:40] Optimized purity: 0.53 INFO [2019-08-15 20:21:40] Fitting variants for local optimum 5/21... INFO [2019-08-15 20:21:48] Fitting variants for purity 0.44, tumor ploidy 3.07 and contamination 0.01. INFO [2019-08-15 20:25:38] Optimized purity: 0.44 INFO [2019-08-15 20:25:38] Fitting variants for local optimum 6/21... INFO [2019-08-15 20:25:46] Fitting variants for purity 0.60, tumor ploidy 4.07 and contamination 0.01. INFO [2019-08-15 20:29:53] Optimized purity: 0.60 INFO [2019-08-15 20:29:53] Fitting variants for local optimum 7/21... INFO [2019-08-15 20:30:01] Fitting variants for purity 0.22, tumor ploidy 3.11 and contamination 0.01. INFO [2019-08-15 20:33:50] Optimized purity: 0.22 INFO [2019-08-15 20:33:50] Fitting variants for local optimum 8/21... INFO [2019-08-15 20:33:58] Fitting variants for purity 0.12, tumor ploidy 5.01 and contamination 0.01. INFO [2019-08-15 20:38:06] Optimized purity: 0.12 INFO [2019-08-15 20:38:06] Fitting variants for local optimum 9/21... INFO [2019-08-15 20:38:14] Fitting variants for purity 0.26, tumor ploidy 5.12 and contamination 0.01. INFO [2019-08-15 20:42:03] Optimized purity: 0.26 INFO [2019-08-15 20:42:03] Fitting variants for local optimum 10/21... INFO [2019-08-15 20:42:11] Fitting variants for purity 0.95, tumor ploidy 4.02 and contamination 0.01. INFO [2019-08-15 20:46:13] Optimized purity: 0.95 INFO [2019-08-15 20:46:13] Fitting variants for local optimum 11/21... INFO [2019-08-15 20:46:21] Fitting variants for purity 0.29, tumor ploidy 4.08 and contamination 0.01. INFO [2019-08-15 20:50:07] Optimized purity: 0.29 INFO [2019-08-15 20:50:07] Fitting variants for local optimum 14/21... INFO [2019-08-15 20:50:14] Fitting variants for purity 0.28, tumor ploidy 1.17 and contamination 0.01. INFO [2019-08-15 20:54:32] Optimized purity: 0.28 INFO [2019-08-15 20:54:32] Fitting variants for local optimum 15/21... INFO [2019-08-15 20:54:40] Fitting variants for purity 0.61, tumor ploidy 1.07 and contamination 0.01. INFO [2019-08-15 20:58:58] Optimized purity: 0.61 INFO [2019-08-15 20:58:58] Fitting variants for local optimum 18/21... INFO [2019-08-15 20:59:06] Fitting variants for purity 0.46, tumor ploidy 5.12 and contamination 0.01. INFO [2019-08-15 21:03:14] Optimized purity: 0.46 INFO [2019-08-15 21:03:14] Fitting variants for local optimum 19/21... INFO [2019-08-15 21:03:22] Fitting variants for purity 0.15, tumor ploidy 4.11 and contamination 0.01. INFO [2019-08-15 21:07:07] Optimized purity: 0.15 INFO [2019-08-15 21:07:07] Fitting variants for local optimum 21/21... INFO [2019-08-15 21:07:15] Fitting variants for purity 0.72, tumor ploidy 2.09 and contamination 0.01. INFO [2019-08-15 21:10:58] Optimized purity: 0.72 INFO [2019-08-15 21:10:58] Done. INFO [2019-08-15 21:10:58] ------------------------------------------------------------ null device 1 null device 1

lima1 commented 5 years ago

Never seen this before. Can you do

ret <- readRDS("Sample001_Tumor.rds") callLOH(ret)

?

Btw, I added a function readLogRatioFile in PureCN 1.15.4 that currently only supports GATK4. If you are using a fairly widely used workflow, happy to add it. It similarly supports such files via --logratiofile in PureCN.R.

PureCN should be robust to contamination rate, but don't attempt to run it on terrible contamination rates >7% or something like that.

vedellpt commented 5 years ago

Hello. Thanks for your reply. I ran these commands two ways. First, when loading PureCN and then the patched PSCBS. And, then, loading PSCBS and then PureCN. In the former case, I got the error message. In the latter case, the callLOH call was successful. It would seem then that the error is specific to the PSCBS callLOH function. Here are the interactive command for each case:

Case 1 :::::::::::::::

suppressPackageStartupMessages(library(PureCN)) suppressPackageStartupMessages(library(PSCBS)) setwd("output"); ret <- readRDS("Sample001_Tumor.rds") callLOH(ret)

Error in UseMethod("callLOH") : no applicable method for 'callLOH' applied to an object of class "list"`

Case 2 :::::::::::::::

suppressPackageStartupMessages(library(PSCBS)) suppressPackageStartupMessages(library(PureCN)) setwd("output"); ret <- readRDS("Sample001_Tumor.rds") tst <- callLOH(ret) str(tst)

'data.frame': 371 obs. of 13 variables: $ chr : chr "chr1" "chr1" "chr1" "chr1" ... $ start : num 879317 1643226 1663266 16376724 16378158 ... $ end : num 1643226 1663266 16376724 16378158 32665428 ... $ arm : chr "p" "p" "p" "p" ... $ C : num 1 2 1 1 1 2 3 2 2 2 ... $ M : int 0 0 0 NA 0 1 NA 1 0 1 ... $ type : chr "LOH" "COPY-NEUTRAL LOH" "LOH" NA ... $ seg.mean : num -0.4186 0.0351 -0.3124 -0.3512 -0.3124 ... $ num.mark : int 244 3 1504 4 1950 292 18 4078 24 1442 ... $ num.snps : int 67 1 270 0 289 41 0 573 2 203 ... $ M.flagged : logi FALSE TRUE FALSE NA FALSE FALSE ... $ maf.expected: num 0.379 0.305 0.379 NA 0.379 ... $ maf.observed: num 0.374 0.211 0.378 NA 0.377 ...

Thanks for letting me know about the function readLogRatioFile in PureCN 1.15.4. It is good that it supports GATK4. I can contact you separately about the workflow that I am currently using.

Thanks for letting me know about the range of robustness of PureCN with respect to contamination. I recall seeing that approximate range in the default parameters for filterVcfBasic. Almost all of the samples I am working with fall into that range. I will not expect useable results for the few outside that range.

vedellpt commented 5 years ago

I have been assuming that for the patch to have its effect, PSCBS would need to be loaded after PureCN. So, in my revision to PureCN.R, I loaded the patched PSCBS after PureCN. Might the patch possibly be effective even if PureCN was loaded after PSCBS? If so, then, could I just reverse the order of loading and get past the error without disabling the patch? I guess this would be unlikely, but I thought I would check. Thanks.

lima1 commented 5 years ago

Ah, yes, PSCBS has a different callLOH function. You don't need to load PSCBS, PureCN will do it automatically.

vedellpt commented 5 years ago

OK. That's great. Then, I think I will be able to get past this error. I will give it a try. Thanks.

lima1 commented 5 years ago

And the patched PSCBS is providing support for a target_weight_file. Not critical at all, but helps to automatically skip germline and poor quality regions in the segmentation.

vedellpt commented 5 years ago

OK. That's good to know. I can see where that would be useful. Thanks.

vedellpt commented 5 years ago

I got the test case to run successfully. I am glad to be running the latest version, to have been able to use IntervalFile.R for generating the interval file, to be using the PureCN.R script, and and have some additional insight into PureCN. Thanks for your help.

lima1 commented 5 years ago

You're welcome. Let me know the outcome of your benchmarking if you can (especially in case you think you can beat the default Quick vignette workflow + PSCBS).