Question about solutions interpretation and PON

tettamanzif commented 1 year ago

Hello!

I am analysing WES data (expected coverage 200x) derived from archival FFPE samples. Some samples have a matched control (peritumoral tissue) some are tumor-only. A group of blood samples from unrelated healthy individuals were also sequenced in order to build a PON for artifacts filtering. I was running PureCN with PSBCS segmentation and following the best parctices (but for some samples the quality is not great as you can imagine)

Estimating purity and ploidy , I tried to look at the solutions considering the information I have from the pathology (which I suppose in many cases is overestimated). Given it is the first time I am analysing these data, I would kindly ask information about how to interpret some solutions. This would help a lot to get some experience on what to look for!

I am attaching three examples (all flagged because of poor GOF) for which I am in doubt about the solution to select, you can see on the left the VAF density plot, in the middle the maximum likelihood solution, and on the right another less likely solution on which I have doubts. Could you tell me if for you in each case the best solution is the more supportive one looking at the graphs?

In general, my main doubts are about: • BAF plots in which BAF appears to be at 0.5 for extensive regions of the genome (Image 1 and 2). Are there any considerations I can make to determine whether the tetraploid state or the diploid state is more appropriate? Is a double peak in the VAF plot a sign of being tertraploid? • BAF plots that are quite noisy (Image 3 and 4). If the signal is not clear, would you keep the selected solution or choose a more conservative (lower ploidy) state? • In case conflicting results between pathology purity estimation and the best solution provided by tools such PureCN, which considerations would you do? Are there any considerations based on VAF that could be made to guide the solution curation?

Regarding technical details, as confirmation I wanted to know if it is fine to build a PON using normal blood samples that were sequenced at lower coverage (60x) vs the FFPE tumor samples (200x)? Moreover, since I have also 6 FFPE paired normal samples, I was creating a PON using both normal bloods and normal tissues to possibly increase the artifacts filtering due to FFPE. The algorithm worked and I obtained lower levels of tumor/normal noise ratio with respect with the PON build only with blood normal samples. Do you think it is fine to proceed like this?

Thank you very much for the time and suggestions you may dedicate! Have a nice day!

Francesca

CNA_Image1_sample_18_T_1_MA CNA_Image2_sample_14_T_1_MA CNA_Image3_sample_23_T_1_MA CNA_Image4_sample_17_T_1_MA

lima1 commented 1 year ago

Hi Francesca,

For the PoN, sounds fine to me, but it's easy to benchmark (as you did). We do exclusively cfDNA for a few years now, so my experience with FFPE/FF mixing is limited. I usually hit a plateau with noise reduction at about 50 or samples from a couple of different batches. More important is testing normalization with and without explicit GC normalization and in WES with and without off-target reads. Make sure to maximize the number of SNPs by including 50bp padding in the variant calling. With WES, you should get 20,000 heterozygous SNPs at least.

For the first one, I think the high ploidy is correct. I'm first looking at segments with balanced SNPs. Here, most are 2/2 (4 copies in total), which is usually a red flag, but there seems to be a convincing 1/1 at 19p that's likely too large for a homozygous (0/0) loss. Also a large number of 2,3,4 and 5 copies. In wrong high ploidy solutions, you usually don't have such an even distribution of states.

For the second, something went wrong here. Looks diploid, but weird that you only have a low number of SNPs.

3 is low purity and noisy, difficult to tell. The mean coverages are different for tumor and normal. Likely not used the PoN, but a matched normal maybe?

4 probably right.

Feel free to post a log file in case you are unsure.

Markus

tettamanzif commented 1 year ago

Hi Markus,

thank you very much for the advices and super fast reply!

From your hints I understand that the interpretation of the data can be rather complex 😊 I am not sure about why if a the genome is mostly in a 2+2 state is a red flag, isn’t it expected in case of WGD? Or do you just mean that a WGD event is unlikely? Regarding the “large number of 2,3,4 and 5 copies”, may I ask why you expect an even distribution of these states? Or if you have available any example of wrong high ploidy solution?

As far as concerns sample no. 3, I am attaching the log file. Even if there is a matched normal, I used only the PoN. I did not make it to use coverage information from the off-target regions (I got an error but unfortunately I don’t have the log here to share).

Regarding SNP, I have called them applying a 100-bp interval padding. As for sample no. 2, using directly the filtered output of Mutect2, most of the SNPs are removed because very few are marked with only the germline flag in the column checked (AS_FilterStatus variable if I remember right), while most of them are marked with a combination of flags. The best I could obtain was by keeping only the variants that I called also in the normal tissue through Haplotypecaller used for other analyses, and avoiding this vcf filtering step in PureCN (also annotating with dbsnp and including the DB info flag). However, as you can see lots of them are still removed for several reasons. If you have any suggestions how to improve this aspect please let me know.

Many thanks again! Have a nice week end,

Francesca sample_23_T_2_MA_THC17.log

lima1 commented 1 year ago

Hi Francesca, do you use the --genotype-germline-sites flag when you have matched normal samples in Mutect2? Last time I tested it, probably 2 years back by now, I just closely followed their best practices with that flag and it produced comparable results to Mutect1. I didn't bother with mixing Haplotype caller variants in it.

If you follow the GATK4 Mutect2 best practices and use the af_only_gnomad VCF for annotation, you don't need a DB flag.

For WGD, yes, you would expect many 2+2, but you usually see a lot of random gains and losses. So if you only see 2+2 and just a small number of gains and losses, it's probably a wrong ploidy and the 2+2 are likely 1+1. Takes a bit of experience in looking at clearly diploid and clearly WGD cases.

tettamanzif commented 1 year ago

Dear Markus,

Many thanks for the clarifications!

Regarding your question, yes I added the --genotype-germline-sites as well as --genotype-pon-sites, but unfortunately in my case many SNPs flagged with germline tag (together with other mutect2 tags) were filtered out. Up to now it is the best I could do..

May I ask you an unrelated additional question? I have in my small cohort a group of samples with a matched normal, and a group of tumor-only samples. After extensive filtering (during and downstream of variants calling), I end up with a TMB value that is much higher in tumor-only samples vs matched samples, so I suppose I have several germline rare variants called by mutect2 as somatic and retained by my filtering process. Therefore I wanted to take advantage of pureCN function to classify variants. I see from the vignette to filter somatic posterior probability (GERMLINE.HOMOZYGOUS) > 0.8 and germline posterior probability < 0.2. For posterior germline probability is intended all GERMLINE.CONTHIGH, GERMLINE.CONTLOW, GERMLINE.HOMOZYGOUS? Do you have any other useful suggestion about this issue?

Lastly, is there a function in PureCN to adjust the observed log2ratio for purity and ploidy of the sample?

Many thanks again for the great support. Have a nice day!

Francesca

lima1 commented 1 year ago

Which other mutect2 tags resulted in the filtering? Which version of GATK are you using? That should not happen. Can you try a recent version if it's old?

The TMB function in PureCN should work really well (see our second manuscript). I think the difference is again due to the upstream filtering in tumor-only vs matched. Matched can remove a small number of artifacts that a PoN does not catch, but it should not be substantial. Most private germline variants should be filtered out, even in fairly high tumor purity. The way it's implemented is like dithering where we are fine with a few random mistakes, but assume they cancel each other out (germline misclassified as somatic in similar rate as somatic misclassified as germline).

See here for a discussion about getting log2ratio if you need them.

tettamanzif commented 1 year ago

Hello Markus,

I am using mutect2 from gatk/4.2.2.0.

I am attaching the combinations of flags reported in the INFO column for one vcf (somatic calls on exome data applying 100 base padding and specifying the two options --genotype-germline-sites as well as --genotype-pon-sites). Of 149130 total rows, 83853 contain the flag germline, and 1406 are tagged only with the germline flag. So most of them are excluded at the step of filterVcfMuTect2. If I use the vcf as it is, I end up with < 1% of targets containing variants and about 100-200 variants used at the very end by the algorithm.

For now the function is returning high TMB values reflecting the high number of variants (the excluded private germline variants reported in the output are not that many). I will try to find a more stringent upstream filtering strategy to decrease the difference bewteen the matched and tumor-only samples.

Many thanks Have a nice day

Francesca

example_info_flags_mutect2.txt

lima1 commented 1 year ago

Can you post all the GATK commands you ran? Hard to diagnose from the distance, but something went wrong. I can double check on my test WES samples. Might take a few days though.

Mutect 1 is super fast and still pretty decent for SNVs, so you might want to give it a try for PureCN until I got a chance to look at recent GATK.

tettamanzif commented 1 year ago

Dear Markus,

that would be really nice from you. I am attaching all gatk commands I ran in my pipeline.

Many thanks in advance. Kind regards,

Francesca

gatk_commands.txt

lima1 commented 1 year ago

Hi Francesca, I think you might have been also hit by #320. I added a --min-base-quality flag to PureCN.R that you can try setting to 20. Hope that helps.

BiocManager::install("lima1/PureCN", ref = "issue_320")

(if the above install fails because issue_320 does not exist anymore when you run it, try without the ref argument).

tettamanzif commented 1 year ago

Hello Markus,

thank you for poiting me to the discussion on the 320 issue.

I have tried using your modified version of PureCN on one of my samples and ineed it can rescue several germline SNPs that were excluded before. It the example below (first solution using the version I had on the left vs using the version you modified on the right), PureCN uses at the end 10 times more variants (from around 1800 to 18 000) to call CNVs. Many of them are still excluded for other reasons, but the results improved a lot and appear similar to other I have seen posted in other discussions, what do you think?

Many thanks for the great support!! Have a nice day,

Francesca

PureCN_issue_320

lima1 commented 1 year ago

Yes, still not a great sample, but at least looks like PureCN did an OK job here. Feel free to open new issues when you run into problems again.

lima1 / PureCN

Question about solutions interpretation and PON #310