hartwigmedical / hmftools

Various algorithms for analysing genomics data
GNU General Public License v3.0
189 stars 58 forks source link

Running unmatched analysis #137

Closed DominikGlodzik closed 3 years ago

DominikGlodzik commented 3 years ago

Dear Team,

this is in impressive suite of tools - thank you for your hard work.

I would like to run Purple in an unmatched fashion - tumor, but without matched normal. I see the tumor_only mode is documented, for Amber, Cobalt and Purple. This workflow is not using unmatched normal samples, as far I as understand, which might be beneficial to correct for any coverage unevenness.

Is it possible to adjust the workflow on my side to use a panel of unmatched normal samples? If I wanted to run a tumor genome against a normal from another patient, what would be sensible settings? I imagine Amber would have to be run in a tumor_only fashion, as running it against normal sample from another individual might confuse the inference of heterozygous alleles.

Please let me know if such an unmatched analysis is possible.

Best wishes Dominik

p-priestley commented 3 years ago

Hi Dominik,

We find that Amber, Cobalt and Purple give very similar results in general with or without matched normal. Using another normal as reference is unlikely to help for the following reasons:

COBALT: in lieu of supplying a matched normal, we provide an equivalent bed file which is calcluated from 100 samples and should give better results than using a normal for another patient.

AMBER: will work well except for very high purity tumors (since if purity is close to 100% we cannot easily differentiate between LOH of heterozygous germline points and homozygous germline points). Using an unmatched normal will not help this;

PURPLE: the main impact of tumor_only is we cannot use somatic point mutations in our fitting (since it is too risky that we make fit germline variants). More details here: https://github.com/hartwigmedical/hmftools/blob/master/purity-ploidy-estimator/README.md#tumor-only-mode

Peter

DominikGlodzik commented 3 years ago

Hello Peter,

Thank you for an exhaustive response.

Re COBALT, could you point me to which .bed file this is?

My query comes from my necessity to analyze unmatched FFPE whole genome samples, and realize this is probably the hardest setting possible. I do have a couple of unmatched FFPE normal samples. I could make an FFPE derived COBALT .bed file and check if it helps the inference.

Best wishes Dominik

On Fri, Nov 20, 2020 at 8:33 PM p-priestley notifications@github.com wrote:

Hi Dominik,

We find that Amber, Cobalt and Purple give very similar results in general with or without matched normal. Using another normal as reference is unlikely to help for the following reasons:

COBALT: in lieu of supplying a matched normal, we provide an equivalent bed file which is calcluated from 100 samples and should give better results than using a normal for another patient.

AMBER: will work well except for very high purity tumors (since if purity is close to 100% we cannot easily differentiate between LOH of heterozygous germline points and homozygous germline points). Using an unmatched normal will not help this;

PURPLE: the main impact of tumor_only is we cannot use somatic point mutations in our fitting (since it is too risky that we make fit germline variants). More details here: https://github.com/hartwigmedical/hmftools/blob/master/purity-ploidy-estimator/README.md#tumor-only-mode

Peter

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/hartwigmedical/hmftools/issues/137#issuecomment-731485284, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFTP5HK7XKDEXM6HZIKOICDSQ4KE3ANCNFSM4T5MKRVA .

p-priestley commented 3 years ago

There is a link to the file in the tumor only section on the cobalt page (https://github.com/hartwigmedical/hmftools/tree/master/count-bam-lines#tumor-only-mode)

I would advise to try using this bed file first.

DominikGlodzik commented 3 years ago

Thank you very much Peter for the explanation.

Could you confirm if I understand the algorithm right?

The tumor coverage per bin is normalized with respect to other regions in the tumor sample. The only ways in which the matched normal is used is to calculate the germline diploid regions, and infer gender. The ratio of tumor to normal coverage in each bin is not actually used even in the tumor-normal analysis.

If that's the case, I see why the matched normal does not help very much, other than in very pure tumor samples.

Thank you once again Dominik

p-priestley commented 3 years ago

Your understanding is basically correct. The coverage of the normal is also used in the smoothing step (where we try to eliminate copy number segments that are likely to be just noise), but the impact is not that great and is not likely to be helped by using an umnatched reference

DominikGlodzik commented 3 years ago

Thank you for the clarification, once again.

In absence of the matched normal, can the smoothing step be tweaked to account for a noisier sample? I cannot see a relevant optional argument.

Could I somehow use the noise estimates from unmatched normals to tweak the smoothing step of tumor-only samples?

Best wishes Dominik

On Sun, Nov 22, 2020 at 2:06 PM p-priestley notifications@github.com wrote:

Your understanding is basically correct. The coverage of the normal is also used in the smoothing step (where we try to eliminate copy number segments that are likely to be just noise), but the impact is not that great and is not likely to be helped by using an umnatched reference

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/hartwigmedical/hmftools/issues/137#issuecomment-731818086, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFTP5HMA3PTWLL2JWZIL2FDSRFOJXANCNFSM4T5MKRVA .

p-priestley commented 3 years ago

There is one optional parameter that could help with smoothing: min_diploid_tumor_ratio_count

If you raise this number then PURPLE will smooth larger noisy region. Details here:

https://github.com/hartwigmedical/hmftools/tree/master/purity-ploidy-estimator#optional-smoothing-arguments

DominikGlodzik commented 3 years ago

Hello Peter,

thank you very much. I now understand my situation. Hoping I would get one more pointer from you.

I run the unmatched (-tumor_only) analysis as recommended, and also experimented with the smoothing parameter. My samples, which are FFPE and particularly noisy, end up with a segmentation that is still very complex. My experience with other tools showed me that having a pool of unmatched normal samples does help to eliminate much of noise. I understand that PURPLE does not use the ratio of coverage between a tumor and matched/unmatched normal samples, and my use case is non one many users may experience.

I wonder if you could give me pointers how I could run Purple and at the same time normalize bin coverage in unmatched normals. My idea would be to first normalize the "tumorReadCount" column ".cobalt.ratio.tsv" against read counts for the same bin in unmatched normal samples (I have already done this, the normalized counts are integer and I preserve the median). The rest of the Purple workflow would stay unchanged, with GC further correction as default. Does my idea make sense to you, given your understanding of how the tool works?

Looking at the code, I cannot figure out how to take such a normalized ".cobalt.ratio.tsv" and process it to obtain ".cobalt.ratio.pcf" file required by Purple. I see two R calls that probably generate the segmentation. 17:25:07 - Executing R script via command: Rscript /tmp/script2607497030507191043.R /mnt/disks/workdisk/hartwig/GER2006079EP_D1//cobalt_unmatched//GER2006079EP_D1.cobalt.ratio.tsv tumorGCRatio /mnt/disks/workdisk/hartwig/GER2006079EP_D1//cobalt_unmatched//GER2006079EP_D1.cobalt.ratio.pcf 17:25:07 - Executing R script via command: Rscript /tmp/script8112142140589310061.R /mnt/disks/workdisk/hartwig/GER2006079EP_D1//cobalt_unmatched//GER2006079EP_D1.cobalt.ratio.tsv referenceGCDiploidRatio /mnt/disks/workdisk/hartwig/GER2006079EP_D1//cobalt_unmatched//DIPLOID.cobalt.ratio.pcf int result = RExecutor.executeFromClasspath("r/ratioSegmentation.R", ratioFile, column, pcfFile);

Could you point me to the R code that performs the segmentation? I cannot find it in the repository at the moment. I imagine I can run the R script on the file with normalized counts, and obtain a segmentation that I could process by Purple in unmatched analysis, as before.

I really appreciate your work and your help and explanations. Dominik

p-priestley commented 3 years ago

Would you mind sharing with me the purple qc plots for one of the samples so I can give specific advice? If not comfortable, can also send directly to me at: p.priestley@hartwigmedicalfoundation.nl This will let me put into context how much noise you are seeing relative to other samples I have seen.

Do I understand correctly that you see recurrent patterns of noise across many samples and would like to try to normalise that out using other normal samples. We chose not to go for normalisation to the normal directly as we found that in our cohort at least that blood samples tended to have more GC bias than tumor samples and that the normalisation tended to produce more noise. A possible better alternative would be to blacklist the highest noise regions. You could do this by altering the GCprofile file that is provided as input based on your observations from unmatched normal. I can provide further details if required

The segmentation use the R copyNumber package. I have not looked at this in the last 2 years, so not really on top of the details at the moment

DominikGlodzik commented 3 years ago

Thank you very much Peter.

Please see examples of QC plots for a sample that appears noisier. Screen Shot 2020-11-24 at 4 06 49 PM Screen Shot 2020-11-24 at 3 37 42 PM Screen Shot 2020-11-24 at 3 37 52 PM Screen Shot 2020-11-24 at 3 38 00 PM Screen Shot 2020-11-24 at 3 38 09 PM Please let me know what you think.

These are FFPE samples, and coverage is biased by other factors, beyond local GC%, like local read length. So yes, in a sense, systematic errors.

It would help me if I could see the R script that is used for segmentation within Cobalt. I could modify it to see if normal-adjusted coverage results in smoother segmentation. I see that the R script is called from Java as follows: int result = RExecutor.executeFromClasspath("r/ratioSegmentation.R", ratioFile, column, pcfFile); but do not understand how to find r/ratioSegmentation.R or where to locate it in the repository.

Many thanks Dominik

jonbaber commented 3 years ago

The segmentation R code is available here.

DominikGlodzik commented 3 years ago

Thank you very much - I will give it a go at let you know if my idea works.

DominikGlodzik commented 3 years ago

Thank you again for your help.

As a reminder, my analysis involves FFPE tissue samples without matched normal sample. I realize this is not a setting many people find themselves in.

I modified the segmentation code to consider coverage in bins and normalized it by coverage in a corresponding bin in a panel of unmatched FFPE normal samples, followed by GC normalization. I also exclude further bins that are particularly noisy among the unmatched normals.

In my case, this does clean up the profiles to some extent. Please see an example below. Default tumor-only analysis

Unknown

The same sample after my modification

Unknown-1

Thanks for your help! Purple is a wonderfully developed tool.

p-priestley commented 3 years ago

Thanks for the feedback. Will keep this in mind if we further develop the tumor only option.

teng-gao commented 2 years ago

A follow up question on this issue of over segmentation - I tried to change the min_diploid_tumor_ratio_count parameter however it seems that I get back the same number of segments as before. Am I passing the argument incorrectly?

java -jar ~/hartwig/purple.jar \
        -tumor_only \
        -tumor $sample \
        -amber ~/external/WASHU/$sample/amber \
        -cobalt ~/external/WASHU/$sample/cobalt \
        -gc_profile ~/hartwig/GC_profile.1000bp.38.cnp \
        -ref_genome ~/ref/hg38.fa \
        -ref_genome_version V38 \
        -threads 16 \
        -output_dir ~/external/WASHU/$sample/purple \
        -min_diploid_tumor_ratio_count 1000
p-priestley commented 2 years ago

Hi Teng, sorry this parameter only works when SV are provided and is meant to reduce residual GC noise when we are confident that most of the genuine CN breakpoints will be mapped to concordant SV breakpoints. I updated the PURPLE readme to reflect this.

I think it would be dangerous to use this smoothing without SV breakpoints.

lipikakalson commented 3 months ago

Thank you again for your help.

As a reminder, my analysis involves FFPE tissue samples without matched normal sample. I realize this is not a setting many people find themselves in.

I modified the segmentation code to consider coverage in bins and normalized it by coverage in a corresponding bin in a panel of unmatched FFPE normal samples, followed by GC normalization. I also exclude further bins that are particularly noisy among the unmatched normals.

In my case, this does clean up the profiles to some extent. Please see an example below. Default tumor-only analysis Unknown The same sample after my modification Unknown-1

Thanks for your help! Purple is a wonderfully developed tool.

We chose not to go for normalisation to the normal directly as we found that in our cohort at least that blood samples tended to have more GC bias than tumor samples and that the normalisation tended to produce more noise. A possible better alternative would be to blacklist the highest noise regions. You could do this by altering the GCprofile file that is provided as input based on your observations from unmatched normal. I can provide further details if required

Hi @p-priestley @DominikGlodzik , I am currently in a similar type of situation, analysing FFPE tumor samples, particularly noisy. @p-priestley I was wondering if these changes have been implemented in the latest version of the tool? @DominikGlodzik, if it's not too much trouble, could you provide some guidance on the modifications you made to the segmentation file? Alternatively, if you have a modified version of the file that you could share, it would be greatly appreciated.

Thank you in advance!

Kind regards, Lipika

p-priestley commented 2 months ago

Hi Lipika - we have not made any improvements around this topic as FFPE has not been a key focus for us, although it is becoming increasingly relevant to us.

The above improvement looks quite impressive, and I would be interested in pursuing the blacklist idea if Dominik can provide more details