lima1 / PureCN

Copy number calling and variant classification using targeted short read sequencing
https://bioconductor.org/packages/devel/bioc/html/PureCN.html
Artistic License 2.0
125 stars 32 forks source link

Error using PureCN.R #20

Closed billnjcn111 closed 6 years ago

billnjcn111 commented 6 years ago

... Removing 259 variants outside intervals. INFO [2018-02-25 11:14:01] Setting somatic prior probabilities for dbSNP hits to 0.000500 or to 0.500000 otherwise. Error: logical subscript contains NAs Execution halted

lima1 commented 6 years ago

Not sure where this happens. Can you by any chance share example input files and reproducible code by email?

If not, and if you use the command line interface, can you post your PureCN.R command line, if you use runAbsoluteCN, the complete list of arguments?

billnjcn111 commented 6 years ago

Do you have a place where I can share all the input files? because a few lines of input file may not reproduce the issue. Thanks A lot

On Sun, Feb 25, 2018 at 11:30 AM, M. Riester notifications@github.com wrote:

Not sure where this happens. Can you by any chance share example input files and reproducible code by email?

If not, and if you use the command line interface, can you post your PureCN.R command line, if you use runAbsoluteCN, the complete list of arguments?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/lima1/PureCN/issues/20#issuecomment-368323118, or mute the thread https://github.com/notifications/unsubscribe-auth/AXw6DLpAt4pUS8WspTJiGgfC9msTK-Zfks5tYYqfgaJpZM4SSWvK .

lima1 commented 6 years ago

Can you share via Dropbox?

Does this error happen in a very minimal test run with just the tumor coverage file, a single normal coverage file, the VCF and the interval file? Or do you use CNVkit as input? The it should be small enough for email (markus.riester at gmail com)

billnjcn111 commented 6 years ago

OK, let me test it using the above minimal option and report you back. I don't use the CNVkit and I used Mutect2 vcf. Hold on...

On Sun, Feb 25, 2018 at 11:45 AM, M. Riester notifications@github.com wrote:

Can you share via Dropbox?

Does this error happen in a very minimal test run with just the tumor coverage file, a single normal coverage file, the VCF and the interval file? Or do you use CNVkit as input? The it should be small enough for email (markus.riester at gmail com)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/lima1/PureCN/issues/20#issuecomment-368324253, or mute the thread https://github.com/notifications/unsubscribe-auth/AXw6DEBEtUK3AXx26dKdw_RgdtIuk-s5ks5tYY4ngaJpZM4SSWvK .

billnjcn111 commented 6 years ago

Well, it seems that with that minimal options there is at least no such error. Now it is doing the grid search which takes time. So initially I had a few more options like mapping bias file (from mutect2), and the target weight file as well as --outvcf --postoptimize --seed 123 Which one caused the trouble? here is a few lines of my mapping bias rds: GRanges object with 301373 ranges and 2 metadata columns: seqnames ranges strand | bias

| 1:13289_CCT/C 1 [13289, 13291] * | NaN 1:13494_A/G 1 [13494, 13494] * | NaN 1:14907_A/G 1 [14907, 14907] * | NaN 1:14930_A/G 1 [14930, 14930] * | NaN 1:14933_G/A 1 [14933, 14933] * | NaN On Sun, Feb 25, 2018 at 11:47 AM, billnjcn wrote: > OK, let me test it using the above minimal option and report you back. I > don't use the CNVkit and I used Mutect2 vcf. > Hold on... > > On Sun, Feb 25, 2018 at 11:45 AM, M. Riester > wrote: > >> Can you share via Dropbox? >> >> Does this error happen in a very minimal test run with just the tumor >> coverage file, a single normal coverage file, the VCF and the interval >> file? Or do you use CNVkit as input? The it should be small enough for >> email (markus.riester at gmail com) >> >> — >> You are receiving this because you authored the thread. >> Reply to this email directly, view it on GitHub >> , or mute >> the thread >> >> . >> > >
lima1 commented 6 years ago

Great, that helps at lot. So the original example without the mapping bias rds should work, right?

If you can share a truncated version of your normal panel VCF (header+those 5 lines), the one that was used to generate the rds, I'll check why this does not work with M2 normals.

billnjcn111 commented 6 years ago

tumor_sample=TCGA-DQ-5629-10A-01D-1870-08

CHROM POS ID REF ALT QUAL FILTER INFO

1 13289 . CCT C . . . 1 13494 . A G . . . 1 14907 . A G . . . 1 14930 . A G . . . 1 14933 . G A . . . 1 14948 . G A . . . 1 14976 . G A . . .

On Sun, Feb 25, 2018 at 12:07 PM, M. Riester notifications@github.com wrote:

Great, that helps at lot. So the original example without the mapping bias rds should work, right?

If you can share a truncated version of your normal panel VCF (header+those 5 lines), the one that was used to generate the rds, I'll check why this does not work with M2 normals.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/lima1/PureCN/issues/20#issuecomment-368325778, or mute the thread https://github.com/notifications/unsubscribe-auth/AXw6DJndAKDOEulsMv22wLBQ819HHv52ks5tYZM0gaJpZM4SSWvK .

billnjcn111 commented 6 years ago

Header is very long, here is the initial ones:

fileformat=VCFv4.2

FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the

ref and alt alleles in the order listed">

FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele fractions of

alternate alleles in the tumor">

FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth

(reads with MQ=255 or with bad mates are filtered)">

FORMAT=<ID=F1R2,Number=R,Type=Integer,Description="Count of reads in F1R2

pair orientation supporting each allele">

FORMAT=<ID=F2R1,Number=R,Type=Integer,Description="Count of reads in F2R1

pair orientation supporting each allele">

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=<ID=MPOS,Number=A,Type=Float,Description="median distance from end

of read">

FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing

haplotype information, describing how the alternate alleles are phased in relation to one another">

FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID

information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">

FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled

likelihoods for genotypes as defined in the VCF specification">

FORMAT=<ID=SA_MAP_AF,Number=3,Type=Float,Description="MAP estimates of

allele fraction given z">

FORMAT=<ID=SA_POST_PROB,Number=3,Type=Float,Description="posterior

probabilities of the presence of strand artifact">

On Sun, Feb 25, 2018 at 12:14 PM, billnjcn billnjcn@gmail.com wrote:

tumor_sample=TCGA-DQ-5629-10A-01D-1870-08

CHROM POS ID REF ALT QUAL FILTER INFO

1 13289 . CCT C . . . 1 13494 . A G . . . 1 14907 . A G . . . 1 14930 . A G . . . 1 14933 . G A . . . 1 14948 . G A . . . 1 14976 . G A . . .

On Sun, Feb 25, 2018 at 12:07 PM, M. Riester notifications@github.com wrote:

Great, that helps at lot. So the original example without the mapping bias rds should work, right?

If you can share a truncated version of your normal panel VCF (header+those 5 lines), the one that was used to generate the rds, I'll check why this does not work with M2 normals.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/lima1/PureCN/issues/20#issuecomment-368325778, or mute the thread https://github.com/notifications/unsubscribe-auth/AXw6DJndAKDOEulsMv22wLBQ819HHv52ks5tYZM0gaJpZM4SSWvK .

billnjcn111 commented 6 years ago

It does not have BQ instead using MBQ, I am not sure how you filtered the low quality variants. but that is another minor thing. Thanks

On Sun, Feb 25, 2018 at 12:15 PM, billnjcn billnjcn@gmail.com wrote:

Header is very long, here is the initial ones:

fileformat=VCFv4.2

FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the

ref and alt alleles in the order listed">

FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele fractions of

alternate alleles in the tumor">

FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth

(reads with MQ=255 or with bad mates are filtered)">

FORMAT=<ID=F1R2,Number=R,Type=Integer,Description="Count of reads in

F1R2 pair orientation supporting each allele">

FORMAT=<ID=F2R1,Number=R,Type=Integer,Description="Count of reads in

F2R1 pair orientation supporting each allele">

FORMAT=

FORMAT=

FORMAT=

FORMAT=<ID=MFRL,Number=R,Type=Float,Description="median fragment

length">

FORMAT=

FORMAT=<ID=MPOS,Number=A,Type=Float,Description="median distance from

end of read">

FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing

haplotype information, describing how the alternate alleles are phased in relation to one another">

FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID

information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">

FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized,

Phred-scaled likelihoods for genotypes as defined in the VCF specification">

FORMAT=<ID=SA_MAP_AF,Number=3,Type=Float,Description="MAP estimates of

allele fraction given z">

FORMAT=<ID=SA_POST_PROB,Number=3,Type=Float,Description="posterior

probabilities of the presence of strand artifact">

On Sun, Feb 25, 2018 at 12:14 PM, billnjcn billnjcn@gmail.com wrote:

tumor_sample=TCGA-DQ-5629-10A-01D-1870-08

CHROM POS ID REF ALT QUAL FILTER INFO

1 13289 . CCT C . . . 1 13494 . A G . . . 1 14907 . A G . . . 1 14930 . A G . . . 1 14933 . G A . . . 1 14948 . G A . . . 1 14976 . G A . . .

On Sun, Feb 25, 2018 at 12:07 PM, M. Riester notifications@github.com wrote:

Great, that helps at lot. So the original example without the mapping bias rds should work, right?

If you can share a truncated version of your normal panel VCF (header+those 5 lines), the one that was used to generate the rds, I'll check why this does not work with M2 normals.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/lima1/PureCN/issues/20#issuecomment-368325778, or mute the thread https://github.com/notifications/unsubscribe-auth/AXw6DJndAKDOEulsMv22wLBQ819HHv52ks5tYZM0gaJpZM4SSWvK .

billnjcn111 commented 6 years ago

You are right, as I said, it passed that place where that error occurred. I am using the Rscript on the command line. It is doing grid search w/o the mapping bias file

On Sun, Feb 25, 2018 at 12:07 PM, M. Riester notifications@github.com wrote:

Great, that helps at lot. So the original example without the mapping bias rds should work, right?

If you can share a truncated version of your normal panel VCF (header+those 5 lines), the one that was used to generate the rds, I'll check why this does not work with M2 normals.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/lima1/PureCN/issues/20#issuecomment-368325778, or mute the thread https://github.com/notifications/unsubscribe-auth/AXw6DJndAKDOEulsMv22wLBQ819HHv52ks5tYZM0gaJpZM4SSWvK .

lima1 commented 6 years ago

Would you mind sharing the truncated VCF (again header + first couple of variant rows) as attachment here or by email? So that I can reproduce?

billnjcn111 commented 6 years ago

Also do you think by adding the mapping bias will raise the accuracy of the predication? I mainly try to find a method to compare to FM 's SGZ algorithm. Thanks

On Sun, Feb 25, 2018 at 12:18 PM, billnjcn billnjcn@gmail.com wrote:

You are right, as I said, it passed that place where that error occurred. I am using the Rscript on the command line. It is doing grid search w/o the mapping bias file

On Sun, Feb 25, 2018 at 12:07 PM, M. Riester notifications@github.com wrote:

Great, that helps at lot. So the original example without the mapping bias rds should work, right?

If you can share a truncated version of your normal panel VCF (header+those 5 lines), the one that was used to generate the rds, I'll check why this does not work with M2 normals.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/lima1/PureCN/issues/20#issuecomment-368325778, or mute the thread https://github.com/notifications/unsubscribe-auth/AXw6DJndAKDOEulsMv22wLBQ819HHv52ks5tYZM0gaJpZM4SSWvK .

lima1 commented 6 years ago

Ah, got it. The VCF does not contain read counts.

lima1 commented 6 years ago

I'm not sure what's the best way of generating the normal panel VCF in GATK4, but it needs to contain all read counts. There is apparently an undocumented MergeVcfs (https://gatkforums.broadinstitute.org/gatk/discussion/10328/combinevariants-in-gatk4).

Yes, it helps quite a bit, especially in flagging variants from regions with high mapping bias where the allele frequencies are much lower than expected. Without fixing this, you'll get a lot of false somatic calls.

billnjcn111 commented 6 years ago

Great, is there an easy fix or I can just leave it. Also as I asked did you see improvement of accuracy of prediction by adding that feature? thx

On Sun, Feb 25, 2018 at 12:21 PM, M. Riester notifications@github.com wrote:

Ah, got it. The VCF does not contain read counts.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/lima1/PureCN/issues/20#issuecomment-368326893, or mute the thread https://github.com/notifications/unsubscribe-auth/AXw6DHvt3xq00I4oaqr1-cjtqNaCQUJOks5tYZafgaJpZM4SSWvK .

lima1 commented 6 years ago

So just to make this clear: You'll need to run Mutect or Mutect2 on the normals in tumor-only mode. Get as many normals as possible. Then merge the normal VCFs into a single VCF that contains REF and ALT read counts in an AD format field.

CombineVariants did this in GATK3. For GATK4, let me know how it goes with MergeVcfs.

billnjcn111 commented 6 years ago

thx

On Sun, Feb 25, 2018 at 12:39 PM, M. Riester notifications@github.com wrote:

So just to make this clear: You'll need to run Mutect or Mutect2 on the normals in tumor-only mode. Get as many normals as possible. Then merge the normal VCFs into a single VCF that contains REF and ALT read counts in an AD format field.

CombineVariants did this in GATK3. For GATK4, let me know how it goes with MergeVcfs.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/lima1/PureCN/issues/20#issuecomment-368328249, or mute the thread https://github.com/notifications/unsubscribe-auth/AXw6DN5TQsIm8oqMvpHyo9vabaFKy78Oks5tYZq7gaJpZM4SSWvK .

billnjcn111 commented 6 years ago

Another error using the minimal options: ... INFO [2018-02-25 12:57:24] Optimized contamination rate: 0.071 INFO [2018-02-25 12:57:24] Done. INFO [2018-02-25 12:57:24]

Warning message: In .bcfHeaderAsSimpleList(header) : duplicate keys in header will be forced to unique rownames null device 1 null device 1 Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘seqnames’ for signature ‘"NULL"’ Calls: write.csv ... .getArmLocations -> match -> seqnames -> Execution halted

On Sun, Feb 25, 2018 at 12:45 PM, billnjcn billnjcn@gmail.com wrote:

thx

On Sun, Feb 25, 2018 at 12:39 PM, M. Riester notifications@github.com wrote:

So just to make this clear: You'll need to run Mutect or Mutect2 on the normals in tumor-only mode. Get as many normals as possible. Then merge the normal VCFs into a single VCF that contains REF and ALT read counts in an AD format field.

CombineVariants did this in GATK3. For GATK4, let me know how it goes with MergeVcfs.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/lima1/PureCN/issues/20#issuecomment-368328249, or mute the thread https://github.com/notifications/unsubscribe-auth/AXw6DN5TQsIm8oqMvpHyo9vabaFKy78Oks5tYZq7gaJpZM4SSWvK .

lima1 commented 6 years ago

Looks like it finished the main steps and generated an output RDS file. Can you send all the successful output by mail?

billnjcn111 commented 6 years ago

.log, _segmentation.pdf, .rds, .csv, .pdf, _local_optima.pdf, _dnacopy.seg, _genes.csv, _variants.csv

On Sun, Feb 25, 2018 at 1:33 PM, M. Riester notifications@github.com wrote:

Looks like it finished the main steps and generated an output RDS file. Can you send all the successful output by mail?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/lima1/PureCN/issues/20#issuecomment-368332532, or mute the thread https://github.com/notifications/unsubscribe-auth/AXw6DJZTrFK3xq6DaJBL8uHQpXoUmcTeks5tYaeCgaJpZM4SSWvK .

billnjcn111 commented 6 years ago

Do you need the files? the rds is too big. Let me see if I can use drop box

On Sun, Feb 25, 2018 at 1:43 PM, billnjcn billnjcn@gmail.com wrote:

.log, _segmentation.pdf, .rds, .csv, .pdf, _localoptima.pdf, dnacopy.seg, _genes.csv, _variants.csv

On Sun, Feb 25, 2018 at 1:33 PM, M. Riester notifications@github.com wrote:

Looks like it finished the main steps and generated an output RDS file. Can you send all the successful output by mail?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/lima1/PureCN/issues/20#issuecomment-368332532, or mute the thread https://github.com/notifications/unsubscribe-auth/AXw6DJZTrFK3xq6DaJBL8uHQpXoUmcTeks5tYaeCgaJpZM4SSWvK .

lima1 commented 6 years ago

Can you try in R:

library(PureCN)
x <- readCurationFile("Sampleid.rds", report.best.only=TRUE)
loh <- callLOH(x); # this probably crashes
saveRDS(x, file="Sampleid_bestonly.rds")

This should dramatically reduce the file size of the RDS (I don't need the others), enough to share by mail.

billnjcn111 commented 6 years ago

Yes, you are right, it is the loh step that caused the crash. and the best.only res is still 100 M.

On Sun, Feb 25, 2018 at 2:18 PM, M. Riester notifications@github.com wrote:

Can you try in R:

library(PureCN) x <- readCurationFile("Sampleid.rds", report.best.only=TRUE) loh <- callLOH(x); # this probably crashes saveRDS(x, file="Sampleid_bestonly.rds")

This should dramatically reduce the file size of the RDS (I don't need the others), enough to share by mail.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/lima1/PureCN/issues/20#issuecomment-368336340, or mute the thread https://github.com/notifications/unsubscribe-auth/AXw6DIjbuV-JeD1R9kVn8qIxNZtk-C9Xks5tYbIdgaJpZM4SSWvK .

lima1 commented 6 years ago

If you can't share by Dropbox, can you help me understand where the crash happens:

x <- readCurationFile("Sampleid.rds", report.best.only=TRUE) debug(PureCN:::.getArmLocations) PureCN:::.getArmLocations(x)

Keep pressing ENTER until it crashes, but please after the line "centromeres <- .getCentromeres(res)", enter "centromeres" so that I can see if there was an issue getting the centromere locations.

lima1 commented 6 years ago

What --genome did you specify? My guess is not one of "hg19" or "hg38", right? It shouldn't crash, obviously, but with unknown genome version, it currently does not know the centromere positions.

billnjcn111 commented 6 years ago

Browse[2]> debug: chromCoords <- chromCoords[as.integer(match(seqnames(centromeres), rownames(chromCoords))), ] Browse[2]> Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘seqnames’ for signature ‘"NULL"’

On Sun, Feb 25, 2018 at 4:45 PM, M. Riester notifications@github.com wrote:

If you can't share by Dropbox, can you help me understand where the crash happens:

x <- readCurationFile("Sampleid.rds", report.best.only=TRUE) debug(PureCN:::.getArmLocations) PureCN:::.getArmLocations(x)

Keep pressing ENTER until it crashes, but please after the line "centromeres <- .getCentromeres(res)", enter "centromeres" so that I can see if there was an issue getting the centromere locations.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/lima1/PureCN/issues/20#issuecomment-368347755, or mute the thread https://github.com/notifications/unsubscribe-auth/AXw6DC29lAkEJuVsGV0jIsVwDtsIHMCZks5tYdSBgaJpZM4SSWvK .

lima1 commented 6 years ago

Try re-running with --genome hg19 or --genome hg38 (NOT GRCh38) if you used something else for --genome. I'll add a check for that to give a more helpful error message.

billnjcn111 commented 6 years ago

I used b37 because all other files are b37

On Sun, Feb 25, 2018 at 4:51 PM, M. Riester notifications@github.com wrote:

What --genome did you specify? My guess is not one of "hg19" or "hg38", right? It shouldn't crash, obviously, but with unknown genome version, it currently does not know the centromere positions.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/lima1/PureCN/issues/20#issuecomment-368348231, or mute the thread https://github.com/notifications/unsubscribe-auth/AXw6DLZnC1Qt9PumO_AgocO0QCyKy62oks5tYdXkgaJpZM4SSWvK .

billnjcn111 commented 6 years ago

I can make another file ripping off car from your hg19 of centromere. Which file shall I change?

On Sun, Feb 25, 2018 at 4:58 PM, billnjcn billnjcn@gmail.com wrote:

I used b37 because all other files are b37

On Sun, Feb 25, 2018 at 4:51 PM, M. Riester notifications@github.com wrote:

What --genome did you specify? My guess is not one of "hg19" or "hg38", right? It shouldn't crash, obviously, but with unknown genome version, it currently does not know the centromere positions.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/lima1/PureCN/issues/20#issuecomment-368348231, or mute the thread https://github.com/notifications/unsubscribe-auth/AXw6DLZnC1Qt9PumO_AgocO0QCyKy62oks5tYdXkgaJpZM4SSWvK .

lima1 commented 6 years ago

Should be safe to use "hg19". It is used in IntervalFile.R to annotate gene symbols and in PureCN.R to map the centromere positions. At some point I'll add genome aliases to that b37 and GRCh37 work.

Otherwise if you want to keep b37, you need to manually change PureCN.R and add

data(centromeres) and then add centromeres=centromeres$hg19 to the runAbsoluteCN call.

billnjcn111 commented 6 years ago

Do you mean the combined VCF has one sample or needs different sample names? I got 16 normal samples only, when combine them, there are 2 ways, one is to list all samples unique and so that all 16 samples will be in each row of vcf. the other is that regardless of sample names, treat them as same sample and just combine all the variants. Which one does it need?

On Sun, Feb 25, 2018 at 12:39 PM, M. Riester notifications@github.com wrote:

So just to make this clear: You'll need to run Mutect or Mutect2 on the normals in tumor-only mode. Get as many normals as possible. Then merge the normal VCFs into a single VCF that contains REF and ALT read counts in an AD format field.

CombineVariants did this in GATK3. For GATK4, let me know how it goes with MergeVcfs.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/lima1/PureCN/issues/20#issuecomment-368328249, or mute the thread https://github.com/notifications/unsubscribe-auth/AXw6DN5TQsIm8oqMvpHyo9vabaFKy78Oks5tYZq7gaJpZM4SSWvK .

lima1 commented 6 years ago

All.unique, we need the read counts from all 16 samples. You can limit the VCF to contain only variants present in at least 2-3 samples. In GATK3 CombineVariants, that's the --minN argument.

What the mapping bias function does is to use collect the alt and ref reads counts from the heterozygous samples only. The sum of alt and ref should be roughly equal, if not, PureCN will either adjust accordingly or ignore the variant if there is a large difference.

billnjcn111 commented 6 years ago

Riester, When you compare your pureCN TMB with FMI TMB, was it tumor only sample? also did you filtering Mutect results using all the Mutect filters, such as t_lod, str_contraction etc??? Did you also remove all common germline variants in all the public data bases, such as 1000G, ExAct, gnomAD etc before /after using pureCN? I am looking for a algorithm to work on Tumor only samples to get FMI like TMB values. Do you have any suggestion when I do this comparison? Thanks

On Sun, Feb 25, 2018 at 5:22 PM, M. Riester notifications@github.com wrote:

All.unique, we need the read counts from all 16 samples. You can limit the VCF to contain only variants present in at least 2-3 samples. In GATK3 CombineVariants, that's the --minN argument.

What the mapping bias function does is to use collect the alt and ref reads counts from the heterozygous samples only. The sum of alt and ref should be roughly equal, if not, PureCN will either adjust accordingly or ignore the variant if there is a large difference.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/lima1/PureCN/issues/20#issuecomment-368350621, or mute the thread https://github.com/notifications/unsubscribe-auth/AXw6DF95g4sAN0TVvgdgk2AP8UkcyRP8ks5tYd0NgaJpZM4SSWvK .

lima1 commented 6 years ago

Sure, the callMutationBurden function is written for tumor-only and should get you accurate mutations/megabase if you follow the instructions. If you use the recommended Mutect 1.1.7, GATK3 CallableLoci and provide a PON VCF, then you get a fairly well tested pipeline.

If not, then it's still largely up to you to do a proper artifact filtering (extremely important!) and testing. The default (available for all VCFs) --error argument will fairly aggressively remove reads with limited support which should remove most sequencing errors. There is also basic Mutect2 filtering as you know, but it's currently experimental.

By default, callMutationBurden is excluding everything that is annotated as DB (dbSNP) and not rescued by COSMIC (Cosmic.CNT info flag). You can create your own VCF info flag that summarizes all the databases you want and then make PureCN.R use this over DB (--dbinfoflag).

billnjcn111 commented 6 years ago

So the key is the PON VCF. The problem for me is that there is not enough normal VCFs available there. So based on your experience, how many normal vcfs you tried with best results? thx

On Mon, Feb 26, 2018 at 11:31 AM, M. Riester notifications@github.com wrote:

Sure, the callMutationBurden function is written for tumor-only and should get you accurate mutations/megabase if you follow the instructions. If you use the recommended Mutect 1.1.7, GATK3 CallableLoci and provide a PON VCF, then you get a fairly well tested pipeline.

If not, then it's still largely up to you to do a proper artifact filtering (extremely important!) and testing. The default (available for all VCFs) --error argument will fairly aggressively remove reads with limited support which should remove most sequencing errors. There is also basic Mutect2 filtering as you know, but it's currently experimental.

By default, callMutationBurden is excluding everything that is annotated as DB (dbSNP) and not rescued by COSMIC (Cosmic.CNT info flag). You can create your own VCF info flag that summarizes all the databases you want and then make PureCN.R use this over DB (--dbinfoflag).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/lima1/PureCN/issues/20#issuecomment-368561843, or mute the thread https://github.com/notifications/unsubscribe-auth/AXw6DPwiJR3alsNPCHZCZgj2HWibIiYpks5tYtxfgaJpZM4SSWvK .

lima1 commented 6 years ago

With 16 you should be fine. Filter variants in simple repeats as done in the vignettes and restrict the mutation burden calculation to coding sequences only (there is a currently undocumented script FilterCallableLoci.R in inst/extdata which takes a BED file as input, usually the PASS regions from CallableLoci, and only keeps overlapping CDS). This will ignore most of the difficult regions anyways.

billnjcn111 commented 6 years ago

Thanks!

On Mon, Feb 26, 2018 at 3:26 PM, M. Riester notifications@github.com wrote:

With 16 you should be fine. Filter variants in simple repeats as done in the vignettes and restrict the mutation burden calculation to coding sequences only (there is a currently undocumented script FilterCallableLoci.R in inst/extdata which takes a BED file as input, usually the PASS regions from CallableLoci, and only keeps overlapping CDS). This will ignore most of the difficult regions anyways.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/lima1/PureCN/issues/20#issuecomment-368638322, or mute the thread https://github.com/notifications/unsubscribe-auth/AXw6DMqnGCJyyfjxOYwOTM8bdFZ8hN4jks5tYxNpgaJpZM4SSWvK .

billnjcn111 commented 6 years ago

Riester, The recommended Mutect 1.1.7 does not call indels? right?

On Mon, Feb 26, 2018 at 11:31 AM, M. Riester notifications@github.com wrote:

Sure, the callMutationBurden function is written for tumor-only and should get you accurate mutations/megabase if you follow the instructions. If you use the recommended Mutect 1.1.7, GATK3 CallableLoci and provide a PON VCF, then you get a fairly well tested pipeline.

If not, then it's still largely up to you to do a proper artifact filtering (extremely important!) and testing. The default (available for all VCFs) --error argument will fairly aggressively remove reads with limited support which should remove most sequencing errors. There is also basic Mutect2 filtering as you know, but it's currently experimental.

By default, callMutationBurden is excluding everything that is annotated as DB (dbSNP) and not rescued by COSMIC (Cosmic.CNT info flag). You can create your own VCF info flag that summarizes all the databases you want and then make PureCN.R use this over DB (--dbinfoflag).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/lima1/PureCN/issues/20#issuecomment-368561843, or mute the thread https://github.com/notifications/unsubscribe-auth/AXw6DPwiJR3alsNPCHZCZgj2HWibIiYpks5tYtxfgaJpZM4SSWvK .

lima1 commented 6 years ago

Correct. But it's easy and very fast to run. So I would use it in your benchmarks to compare against M2 if you really need indels. If you see a shift in TMB, you know that the filtering needs work. Feel free to open an issue whenever you see problems. Also note that the default in callMutationBurden ignores indels.

Looks like there is not yet a CallableLoci replacement in GATK4 ( https://gatkforums.broadinstitute.org/gatk/discussion/11178/callableloci-replacement-in-gatk4). I would definitely recommend running GATK3 CallableLoci with --minDepth 15 or whatever you use as min coverage in Mutect.

billnjcn111 commented 6 years ago

There will be a slight shift since the indwells are missing, right?

On Thu, Mar 1, 2018 at 1:39 PM, M. Riester notifications@github.com wrote:

Correct. But it's easy and very fast to run. So I would use it in your benchmarks to compare against M2 if you really need indels. If you see a shift in TMB, you know that the filtering needs work. Feel free to open an issue whenever you see problems. Also note that the default in callMutationBurden ignores indels.

Looks like there is not yet a CallableLoci replacement in GATK4 ( https://gatkforums.broadinstitute.org/gatk/discussion/11178/callableloci- replacement-in-gatk4). I would definitely recommend running GATK3 CallableLoci with --minDepth 15 or whatever you use as min coverage in Mutect.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/lima1/PureCN/issues/20#issuecomment-369688231, or mute the thread https://github.com/notifications/unsubscribe-auth/AXw6DIKdBvEUeu512GNen_zEJJAXv2c5ks5taEBKgaJpZM4SSWvK .

lima1 commented 6 years ago

Yes, but like I said, by default, indels are not counted. I added a flag to the Dx.R script to count them (--keepindels).

billnjcn111 commented 6 years ago

There appears to be a new fatal error:

On Fri, Mar 2, 2018 at 9:50 PM, M. Riester notifications@github.com wrote:

Yes, but like I said, by default, indels are not counted. I added a flag to the Dx.R script to count them (--keepindels).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/lima1/PureCN/issues/20#issuecomment-370110800, or mute the thread https://github.com/notifications/unsubscribe-auth/AXw6DPS2ZYZ9GT7-YeIEWH5GPf4u9k3Tks5tagT4gaJpZM4SSWvK .

billnjcn111 commented 6 years ago

All 46 samples are the same:

INFO [2018-03-08 09:44:42] PureCN 1.9.28 INFO [2018-03-08 09:44:42]

..... Cannot find valid purity/ploidy solution. This happens when input

FATAL [2018-03-08 09:51:23] segmentations are garbage, most likely due to a catastrophic sample QC

FATAL [2018-03-08 09:51:23] failure. Re-check standard QC metrics for this sample.

FATAL [2018-03-08 09:51:23]

FATAL [2018-03-08 09:51:23] This is most likely a user error due to invalid input data or

FATAL [2018-03-08 09:51:23] parameters (PureCN 1.9.28).

On Thu, Mar 8, 2018 at 10:13 AM, billnjcn billnjcn@gmail.com wrote:

There appears to be a new fatal error:

On Fri, Mar 2, 2018 at 9:50 PM, M. Riester notifications@github.com wrote:

Yes, but like I said, by default, indels are not counted. I added a flag to the Dx.R script to count them (--keepindels).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/lima1/PureCN/issues/20#issuecomment-370110800, or mute the thread https://github.com/notifications/unsubscribe-auth/AXw6DPS2ZYZ9GT7-YeIEWH5GPf4u9k3Tks5tagT4gaJpZM4SSWvK .

lima1 commented 6 years ago

I would need again the whole output of the log file and the command lines used to generate the input.

When all samples show this, then there is likely an issue with the setup.

billnjcn111 commented 6 years ago

command line argument: INFO [2018-03-08 09:13:41] Arguments: -tumor.coverage.file 71-D_recal_coverage_loess.txt -seg.file -vcf.file 71-D_filtered_dbsnp.vcf.gz -genome hg19 -sex ? -args.setMappingBiasVcf NULL -args.segmentation target_weights_SS_v6_hg19.txt,0.005,NULL -sampleid 71-D -min.ploidy 1 -max.ploidy 6 -max.non.clonal 0.2 -log.ratio.calibration 0.1 -model.homozygous FALSE -error 0.001 -interval.file baits_hg19_V6_intervals.txt -gc.gene.file -max.segments 300 -plot.cnv TRUE -DB.info.flag DB -model beta -post.optimize FALSE -log.file 71-D.log -normal.coverage.file -normalDB -args.filterVcf -fun.segmentation -test.num.copy -test.purity -speedup.heuristics

On Thu, Mar 8, 2018 at 11:09 AM, M. Riester notifications@github.com wrote:

I would need again the whole output of the log file and the command lines used to generate the input.

When all samples show this, then there is likely an issue with the setup.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/lima1/PureCN/issues/20#issuecomment-371534605, or mute the thread https://github.com/notifications/unsubscribe-auth/AXw6DAoMm_0Y1Uv13JNAoKrKKIyjhovrks5tcVeigaJpZM4SSWvK .

billnjcn111 commented 6 years ago

INFO [2018-03-08 09:13:41] Loading coverage files... INFO [2018-03-08 09:13:48] Mean target coverages: 88X (tumor) 76X (normal). INFO [2018-03-08 09:13:50] Mean coverages: chrX: 72.95, chrY: 1.41, chr1-22: 82.92. INFO [2018-03-08 09:13:50] Mean coverages: chrX: 82.62, chrY: 0.64, chr1-22: 71.27. INFO [2018-03-08 09:14:16] Removing 35593 intervals with missing log.ratio. INFO [2018-03-08 09:14:16] Removing 9 low/high GC targets. INFO [2018-03-08 09:14:21] Removing 3762 targets excluded in normalDB. INFO [2018-03-08 09:14:21] Removing 265 targets with low total coverage in normal (< 150.00 reads). INFO [2018-03-08 09:14:21] normalDB provided. Setting minimum coverage for segmentation to 0.0015X. INFO [2018-03-08 09:14:21] Removing 9 low coverage (< 0.0015X) targets. INFO [2018-03-08 09:14:22] Using 247792 intervals (247792 on-target, 0 off-target). INFO [2018-03-08 09:14:22] No off-target intervals. If this is hybrid-capture data, consider adding them. INFO [2018-03-08 09:14:23] AT/GC dropout: 1.07 (tumor), 1.08 (normal). INFO [2018-03-08 09:14:23] Loading VCF... INFO [2018-03-08 09:14:23] Found 12980 variants in VCF file. INFO [2018-03-08 09:14:24] Removing 428 triallelic sites. WARN [2018-03-08 09:14:24] vcf.file has no DB info field for dbSNP membership. Guessing it based on ID. INFO [2018-03-08 09:14:24] 7504 (59.8%) variants annotated as likely germline (DB INFO flag). INFO [2018-03-08 09:14:24] 71-D is tumor in VCF file. INFO [2018-03-08 09:14:24] 39 homozygous and 104 heterozygous variants on chrX. INFO [2018-03-08 09:14:24] Sex from VCF: F (Fisher's p-value: 0.924, odds-ratio: 1.03). INFO [2018-03-08 09:14:24] Detected MuTect2 VCF. INFO [2018-03-08 09:14:24] Removing 6392 MuTect2 calls due to blacklisted failure reasons. INFO [2018-03-08 09:14:25] Initial testing for significant sample cross-contamination: maybe INFO [2018-03-08 09:14:25] Removing 2233 variants with AF < 0.030 or AF >= 0.970 or less than 4 supporting reads or depth < 15. INFO [2018-03-08 09:14:25] Removing 15 low quality variants with BQ < 25. INFO [2018-03-08 09:14:25] Total size of targeted genomic region: 55.83Mb (75.03Mb with 50bp padding). INFO [2018-03-08 09:14:25] 1.4% of targets contain variants. INFO [2018-03-08 09:14:25] Removing 403 variants outside intervals. INFO [2018-03-08 09:14:25] Setting somatic prior probabilities for dbSNP hits to 0.000500 or to 0.500000 otherwise. INFO [2018-03-08 09:14:25] VCF does not contain somatic status. For best results, consider providing normal.panel.vcf.file when matched normals are not available. INFO [2018-03-08 09:14:25] Sample sex: F INFO [2018-03-08 09:14:25] Segmenting data... INFO [2018-03-08 09:14:26] Target weights found, will use weighted CBS. INFO [2018-03-08 09:14:26] Loading pre-computed boundaries for DNAcopy... INFO [2018-03-08 09:14:26] Setting undo.SD parameter to 1.250000. INFO [2018-03-08 09:14:39] Setting prune.hclust.h parameter to 0.150000. INFO [2018-03-08 09:14:39] Found 99 segments with median size of 1.64Mb. INFO [2018-03-08 09:14:39] Removing 2 variants outside segments. INFO [2018-03-08 09:14:39] Using 3507 variants. INFO [2018-03-08 09:14:40] Mean standard deviation of log-ratios: 1.13 INFO [2018-03-08 09:14:40] 2D-grid search of purity and ploidy... FATAL [2018-03-08 09:19:40] Cannot find valid purity/ploidy solution. This happens when input

FATAL [2018-03-08 09:19:40] segmentations are garbage, most likely due to a catastrophic sample QC

FATAL [2018-03-08 09:19:40] failure. Re-check standard QC metrics for this sample.

FATAL [2018-03-08 09:19:40]

FATAL [2018-03-08 09:19:40] This is most likely a user error due to invalid input data or

FATAL [2018-03-08 09:19:40] parameters (PureCN 1.9.28).

On Thu, Mar 8, 2018 at 11:16 AM, billnjcn billnjcn@gmail.com wrote:

command line argument: INFO [2018-03-08 09:13:41] Arguments: -tumor.coverage.file 71-D_recal_coverage_loess.txt -seg.file -vcf.file 71-D_filtered_dbsnp.vcf.gz -genome hg19 -sex ? -args.setMappingBiasVcf NULL -args.segmentation target_weights_SS_v6_hg19.txt,0.005,NULL -sampleid 71-D -min.ploidy 1 -max.ploidy 6 -max.non.clonal 0.2 -log.ratio.calibration 0.1 -model.homozygous FALSE -error 0.001 -interval.file baits_hg19_V6_intervals.txt -gc.gene.file -max.segments 300 -plot.cnv TRUE -DB.info.flag DB -model beta -post.optimize FALSE -log.file 71-D.log -normal.coverage.file -normalDB -args.filterVcf -fun.segmentation -test.num.copy -test.purity -speedup.heuristics

On Thu, Mar 8, 2018 at 11:09 AM, M. Riester notifications@github.com wrote:

I would need again the whole output of the log file and the command lines used to generate the input.

When all samples show this, then there is likely an issue with the setup.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/lima1/PureCN/issues/20#issuecomment-371534605, or mute the thread https://github.com/notifications/unsubscribe-auth/AXw6DAoMm_0Y1Uv13JNAoKrKKIyjhovrks5tcVeigaJpZM4SSWvK .

lima1 commented 6 years ago

Your log-ratios are extremely noisy (1.13), good data is below 0.4 with coverage that low.

Are you sure you are using the correct baits file? You have almost 40k intervals without any coverage. Did NormalDB.R give you a warning that you used the wrong baits file? If then, try to re-run IntervalFile.R and provide the low_coverage_targets.bed file via --exclude. You have a large GC bias, but could be because of the wrong baits file.

You also have only a small fraction of intervals overlapping with variants (1.4%). This should be around 10%. Make sure to run Mutect with the same interval file and add 50-75bp padding (this will at least double the number of SNPs).

And then, make sure to generate a PON VCF for a production setting.

lima1 commented 6 years ago

Also, is there a reason you dropped --offtarget?

billnjcn111 commented 6 years ago

Thanks, the bait file may be the exact reason why it failed!

On Thu, Mar 8, 2018 at 1:55 PM, M. Riester notifications@github.com wrote:

Also, is there a reason you dropped --offtarget?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/lima1/PureCN/issues/20#issuecomment-371586879, or mute the thread https://github.com/notifications/unsubscribe-auth/AXw6DHhF8d27cUYzvasaFoeiz__jeKx7ks5tcX6vgaJpZM4SSWvK .