parklab / MosaicForecast

A mosaic detecting software based on phasing and random forest
MIT License
61 stars 21 forks source link

Final result filtering #18

Closed gevro closed 3 years ago

gevro commented 3 years ago

Hi, For the final .predictions file, do you have recommendations how to filter for mosaic filters, thresholds to use, etc?

Also, how many mosaic variants do you expect on average for a normal blood samples? I'm getting ~150, which seems like a lot. Thanks.

douym commented 3 years ago

Hi, For the final .predictions file, do you have recommendations how to filter for mosaic filters, thresholds to use, etc?

Also, how many mosaic variants do you expect on average for a normal blood samples? I'm getting ~150, which seems like a lot. Thanks.

Hi @gevro , typically I would use the "prediction" column and select those predicted as mosaic. As for how many sites per blood sample, that depends: what is your read depth and what is the age of the sample? Are these variants mostly low-AF variants, and what are the types of substitutions? I ask about age because there could be Clonal Hematopoiesis especially for older individuals.

gevro commented 3 years ago

Coverage is 40x, and the individual is 30 years old. All AFs are between 5-25%. 90% of the SNVs are either T>G or A>C (about 50/50 proportion).

Note that skin from the same person at 60x gives only 15 mutations.

Both are PCR-free libraries.

Does this make sense?

douym commented 3 years ago

Coverage is 40x, and the individual is 30 years old. All AFs are between 5-25%. 90% of the SNVs are either T>G or A>C (about 50/50 proportion).

Note that skin from the same person at 60x gives only 15 mutations.

Both are PCR-free libraries.

Does this make sense?

Hi @gevro ,

Given the coverage, I would think the skin number is more reasonable, but if the two samples are sequenced with the same platform at similar time, it's possible that there are some interesting happen in the blood samples. Have you checked the 96-type signature profiles and does this look like certain published cosmic signatures likely to be artefacts? if not, I would suggest you to pick some sites for validation.

One more question is how are the samples prepared? are these samples FFPE samples? chemical changes could happen during library preparation.

Best,

Yanmei

gevro commented 3 years ago

Hi, It is frozen samples, not FFPE.

Regarding mutation signature context, MF final output has a 'context' column, but it looks like it might not be reverse complemented correctly? Here are the T>G and A>C variants in the sample. You can see that the context doesn't always match to the variant. For example the first line T>G is in context GAC, which I am guessing should actually be written as GTC? I can fix this manually, but maybe would be helpful to have another column where it is reverse complemented to the contexts that match the standard COSMIC signature contexts.

sample~chr1~12526321~T~G    GAC
sample~chr1~109902767~T~G   ATG
sample~chr1~113957274~A~C   CAG
sample~chr1~163770040~A~C   TTG
sample~chr1~171208476~A~C   ATG
sample~chr1~206873024~T~G   ATG
sample~chr2~6814044~A~C GTG
sample~chr2~18660400~A~C    GTG
sample~chr2~26512911~A~C    GTG
sample~chr2~46356959~T~G    ATG
sample~chr2~54243644~A~C    GTG
sample~chr2~90073801~T~G    TTG
sample~chr2~96552875~T~G    GTG
sample~chr2~119516033~A~C   GTG
sample~chr2~120865745~A~C   TTG
sample~chr2~183499680~T~G   AAG
sample~chr2~198055911~A~C   ATG
sample~chr2~200039499~T~G   GTG
sample~chr2~220416809~A~C   ATG
sample~chr2~227361710~A~C   CAG
sample~chr2~233902687~T~G   GTG
sample~chr3~11653930~A~C    AAG
sample~chr3~13572181~A~C    GTG
sample~chr3~36857684~T~G    TTG
sample~chr3~45600576~T~G    GTT
sample~chr3~101844732~T~G   GTG
sample~chr3~106106153~T~G   CAT
sample~chr3~110803540~A~C   ATA
sample~chr3~138472336~A~C   ATG
sample~chr3~187139113~T~G   TTG
sample~chr4~1865397~A~C ATG
sample~chr4~31769221~A~C    ATG
sample~chr4~54706065~A~C    GAC
sample~chr4~61580710~A~C    ATG
sample~chr4~146268637~A~C   TTG
sample~chr4~146525132~A~C   TAC
sample~chr4~148938220~T~G   TTG
sample~chr4~159404340~A~C   TAT
sample~chr4~160786763~T~G   GTG
sample~chr4~185336367~A~C   ATG
sample~chr5~40336272~A~C    GTG
sample~chr5~43026995~A~C    AAA
sample~chr5~52664264~T~G    TTG
sample~chr5~112353302~T~G   GTT
sample~chr5~159845892~A~C   TTG
sample~chr5~161335262~T~G   GTG
sample~chr6~8559670~A~C GTG
sample~chr6~19089049~A~C    ATG
sample~chr6~33689263~A~C    ATG
sample~chr6~41094407~T~G    ATG
sample~chr6~51706183~A~C    ATG
sample~chr6~62054457~A~C    TAA
sample~chr6~72003103~A~C    AAG
sample~chr6~72616190~T~G    GTG
sample~chr6~95431520~T~G    TTG
sample~chr6~130353772~A~C   GTG
sample~chr7~47625048~A~C    GTG
sample~chr7~141102907~T~G   ATG
sample~chr8~32348206~A~C    ATG
sample~chr8~32471910~T~G    ATG
sample~chr8~81956949~A~C    TTT
sample~chr8~127288437~A~C   GTG
sample~chr8~128068489~T~G   GTT
sample~chr8~137328297~A~C   GTG
sample~chr9~35581148~T~G    TTT
sample~chr9~89588808~T~G    TTG
sample~chr9~111519982~T~G   ATT
sample~chr9~130635158~T~G   GTG
sample~chr9~134208675~T~G   ATT
sample~chr10~34237100~T~G   GTG
sample~chr10~55835644~A~C   ATG
sample~chr10~62049258~A~C   ATG
sample~chr10~69863766~A~C   TAT
sample~chr10~70917094~A~C   GTG
sample~chr10~72061220~A~C   ATG
sample~chr10~79018393~A~C   CTC
sample~chr10~96449725~T~G   TAG
sample~chr10~105249920~T~G  ATG
sample~chr10~111876461~T~G  ATG
sample~chr10~124866598~A~C  ATG
sample~chr11~6722574~T~G    ATG
sample~chr11~8092960~A~C    CAT
sample~chr11~17730264~A~C   ATG
sample~chr11~29076475~T~G   GTG
sample~chr11~39318232~T~G   ATG
sample~chr11~46329780~A~C   CAC
sample~chr11~65167886~A~C   TAG
sample~chr11~109363703~T~G  ATG
sample~chr11~115664069~T~G  ATG
sample~chr11~132136298~A~C  GTG
sample~chr12~3564695~T~G    TTG
sample~chr12~49002854~T~G   GTG
sample~chr12~53959430~A~C   CAA
sample~chr12~60682771~A~C   ATG
sample~chr12~76522166~A~C   ATG
sample~chr12~85883488~A~C   TAC
sample~chr12~104972440~T~G  ATG
sample~chr12~109178764~T~G  TTT
sample~chr12~115636321~T~G  ATG
sample~chr12~131118105~A~C  GTG
sample~chr13~69622254~T~G   CAA
sample~chr13~71490536~T~G   ATG
sample~chr13~105961692~A~C  ATG
sample~chr14~57890027~A~C   GTG
sample~chr15~40499611~A~C   TAC
sample~chr15~53606021~T~G   ATG
sample~chr15~64747357~T~G   CAG
sample~chr15~68867588~A~C   GTG
sample~chr15~70246053~T~G   TTG
sample~chr15~93562724~T~G   ATG
sample~chr16~3753384~A~C    TTG
sample~chr16~58122980~A~C   ATG
sample~chr16~80812689~A~C   TTT
sample~chr16~85554566~T~G   GTG
sample~chr16~87197112~T~G   ATG
sample~chr17~18989525~A~C   GTG
sample~chr17~45002638~A~C   ATT
sample~chr17~54062520~T~G   GTG
sample~chr17~74802292~T~G   TAG
sample~chr18~6208991~A~C    GAT
sample~chr18~28585148~A~C   ATT
sample~chr18~28913765~T~G   ATG
sample~chr18~45313179~T~G   ATG
sample~chr19~1264257~A~C    CAG
sample~chr19~16689791~A~C   CTG
sample~chr19~33260526~T~G   ACA
sample~chr19~34846723~A~C   ATG
sample~chr19~47226007~T~G   ATG
sample~chr20~39824231~A~C   GTG
sample~chr21~37906394~T~G   GTG
sample~chr21~39166790~T~G   GTG
sample~chr22~25150713~A~C   ATG
sample~chr22~25973864~A~C   ATG
sample~chr22~28319417~A~C   TAG
sample~chr22~44068269~T~G   ATG
sample~chr22~49522039~T~G   ATG
sample~chrX~14875752~T~G    CAT
sample~chrX~17129347~A~C    ATG

Here is the data after fixing the mutation contexts (i.e., reverse complementing when necessary):

sample chr1 12526321 T G GTC
sample  chr1    109902767   T   G   ATG
sample chr1 113957274 T G CTG
sample chr1 163770040 T G TTG
sample chr1 171208476 T G ATG
sample  chr1    206873024   T   G   ATG
sample chr2 6814044 T G GTG
sample chr2 18660400 T G GTG
sample chr2 26512911 T G GTG
sample  chr2    46356959    T   G   ATG
sample chr2 54243644 T G GTG
sample  chr2    90073801    T   G   TTG
sample  chr2    96552875    T   G   GTG
sample chr2 119516033 T G GTG
sample chr2 120865745 T G TTG
sample chr2 183499680 T G CTT
sample chr2 198055911 T G ATG
sample  chr2    200039499   T   G   GTG
sample chr2 220416809 T G ATG
sample chr2 227361710 T G CTG
sample  chr2    233902687   T   G   GTG
sample chr3 11653930 T G CTT
sample chr3 13572181 T G GTG
sample  chr3    36857684    T   G   TTG
sample  chr3    45600576    T   G   GTT
sample  chr3    101844732   T   G   GTG
sample chr3 106106153 T G ATG
sample chr3 110803540 T G ATA
sample chr3 138472336 T G ATG
sample  chr3    187139113   T   G   TTG
sample chr4 1865397 T G ATG
sample chr4 31769221 T G ATG
sample chr4 54706065 T G GTC
sample chr4 61580710 T G ATG
sample chr4 146268637 T G TTG
sample chr4 146525132 T G GTA
sample  chr4    148938220   T   G   TTG
sample chr4 159404340 T G ATA
sample  chr4    160786763   T   G   GTG
sample chr4 185336367 T G ATG
sample chr5 40336272 T G GTG
sample chr5 43026995 T G TTT
sample  chr5    52664264    T   G   TTG
sample  chr5    112353302   T   G   GTT
sample chr5 159845892 T G TTG
sample  chr5    161335262   T   G   GTG
sample chr6 8559670 T G GTG
sample chr6 19089049 T G ATG
sample chr6 33689263 T G ATG
sample  chr6    41094407    T   G   ATG
sample chr6 51706183 T G ATG
sample chr6 62054457 T G TTA
sample chr6 72003103 T G CTT
sample  chr6    72616190    T   G   GTG
sample  chr6    95431520    T   G   TTG
sample chr6 130353772 T G GTG
sample chr7 47625048 T G GTG
sample  chr7    141102907   T   G   ATG
sample chr8 32348206 T G ATG
sample  chr8    32471910    T   G   ATG
sample chr8 81956949 T G TTT
sample chr8 127288437 T G GTG
sample  chr8    128068489   T   G   GTT
sample chr8 137328297 T G GTG
sample  chr9    35581148    T   G   TTT
sample  chr9    89588808    T   G   TTG
sample  chr9    111519982   T   G   ATT
sample  chr9    130635158   T   G   GTG
sample  chr9    134208675   T   G   ATT
sample  chr10   34237100    T   G   GTG
sample chr10 55835644 T G ATG
sample chr10 62049258 T G ATG
sample chr10 69863766 T G ATA
sample chr10 70917094 T G GTG
sample chr10 72061220 T G ATG
sample chr10 79018393 T G CTC
sample chr10 96449725 T G CTA
sample  chr10   105249920   T   G   ATG
sample  chr10   111876461   T   G   ATG
sample chr10 124866598 T G ATG
sample  chr11   6722574 T   G   ATG
sample chr11 8092960 T G ATG
sample chr11 17730264 T G ATG
sample  chr11   29076475    T   G   GTG
sample  chr11   39318232    T   G   ATG
sample chr11 46329780 T G GTG
sample chr11 65167886 T G CTA
sample  chr11   109363703   T   G   ATG
sample  chr11   115664069   T   G   ATG
sample chr11 132136298 T G GTG
sample  chr12   3564695 T   G   TTG
sample  chr12   49002854    T   G   GTG
sample chr12 53959430 T G TTG
sample chr12 60682771 T G ATG
sample chr12 76522166 T G ATG
sample chr12 85883488 T G GTA
sample  chr12   104972440   T   G   ATG
sample  chr12   109178764   T   G   TTT
sample  chr12   115636321   T   G   ATG
sample chr12 131118105 T G GTG
sample chr13 69622254 T G TTG
sample  chr13   71490536    T   G   ATG
sample chr13 105961692 T G ATG
sample chr14 57890027 T G GTG
sample chr15 40499611 T G GTA
sample  chr15   53606021    T   G   ATG
sample chr15 64747357 T G CTG
sample chr15 68867588 T G GTG
sample  chr15   70246053    T   G   TTG
sample  chr15   93562724    T   G   ATG
sample chr16 3753384 T G TTG
sample chr16 58122980 T G ATG
sample chr16 80812689 T G TTT
sample  chr16   85554566    T   G   GTG
sample  chr16   87197112    T   G   ATG
sample chr17 18989525 T G GTG
sample chr17 45002638 T G ATT
sample  chr17   54062520    T   G   GTG
sample chr17 74802292 T G CTA
sample chr18 6208991 T G ATC
sample chr18 28585148 T G ATT
sample  chr18   28913765    T   G   ATG
sample  chr18   45313179    T   G   ATG
sample chr19 1264257 T G CTG
sample chr19 16689791 T G CTG
sample chr19 33260526 T G TGT
sample chr19 34846723 T G ATG
sample  chr19   47226007    T   G   ATG
sample chr20 39824231 T G GTG
sample  chr21   37906394    T   G   GTG
sample  chr21   39166790    T   G   GTG
sample chr22 25150713 T G ATG
sample chr22 25973864 T G ATG
sample chr22 28319417 T G CTA
sample  chr22   44068269    T   G   ATG
sample  chr22   49522039    T   G   ATG
sample chrX 14875752 T G ATG
sample chrX 17129347 T G ATG

Here is the resulting mutation context summary, all of which are T>G changes:

      3 ATA
      1 ATC
     51 ATG
      4 ATT
      4 CTA
      1 CTC
      5 CTG
      3 CTT
      3 GTA
      2 GTC
     35 GTG
      3 GTT
      1 TGT
      1 TTA
     16 TTG
      5 TTT

The most abundant ones are ATG, GTG, and TTG. It does not match any cancer signatures, but it perhaps matches (though not perfectly) the SBS55 sequencing artifact signature: https://cancer.sanger.ac.uk/cosmic/signatures/SBS/SBS55.tt

Have you seen this in MF results before? Sequencing was on a Novaseq PCR-Free.

Note, this might be related to mention in this paper: https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-019-0695-x "Thymine to Guanine transversion artifacts in Guanine-rich context"

It looks like the Firevat tool may help filter these, but perhaps the RF model of MF should be able to detect sequencing artifacts too? But this is just a hypothesis. Maybe these are real.

gevro commented 3 years ago

UPDATE: It looks like there is a bug in Mutect2. It is never emitting the strand_bias filter. And all the above variants would be filtered by the strand_bias filter. I have written to the GATK team about this.

douym commented 3 years ago

Hi, It is frozen samples, not FFPE.

Regarding mutation signature context, MF final output has a 'context' column, but it looks like it might not be reverse complemented correctly? Here are the T>G and A>C variants in the sample. You can see that the context doesn't always match to the variant. For example the first line T>G is in context GAC, which I am guessing should actually be written as GTC? I can fix this manually, but maybe would be helpful to have another column where it is reverse complemented to the contexts that match the standard COSMIC signature contexts.

sample~chr1~12526321~T~G  GAC
sample~chr1~109902767~T~G ATG
sample~chr1~113957274~A~C CAG
sample~chr1~163770040~A~C TTG
sample~chr1~171208476~A~C ATG
sample~chr1~206873024~T~G ATG
sample~chr2~6814044~A~C   GTG
sample~chr2~18660400~A~C  GTG
sample~chr2~26512911~A~C  GTG
sample~chr2~46356959~T~G  ATG
sample~chr2~54243644~A~C  GTG
sample~chr2~90073801~T~G  TTG
sample~chr2~96552875~T~G  GTG
sample~chr2~119516033~A~C GTG
sample~chr2~120865745~A~C TTG
sample~chr2~183499680~T~G AAG
sample~chr2~198055911~A~C ATG
sample~chr2~200039499~T~G GTG
sample~chr2~220416809~A~C ATG
sample~chr2~227361710~A~C CAG
sample~chr2~233902687~T~G GTG
sample~chr3~11653930~A~C  AAG
sample~chr3~13572181~A~C  GTG
sample~chr3~36857684~T~G  TTG
sample~chr3~45600576~T~G  GTT
sample~chr3~101844732~T~G GTG
sample~chr3~106106153~T~G CAT
sample~chr3~110803540~A~C ATA
sample~chr3~138472336~A~C ATG
sample~chr3~187139113~T~G TTG
sample~chr4~1865397~A~C   ATG
sample~chr4~31769221~A~C  ATG
sample~chr4~54706065~A~C  GAC
sample~chr4~61580710~A~C  ATG
sample~chr4~146268637~A~C TTG
sample~chr4~146525132~A~C TAC
sample~chr4~148938220~T~G TTG
sample~chr4~159404340~A~C TAT
sample~chr4~160786763~T~G GTG
sample~chr4~185336367~A~C ATG
sample~chr5~40336272~A~C  GTG
sample~chr5~43026995~A~C  AAA
sample~chr5~52664264~T~G  TTG
sample~chr5~112353302~T~G GTT
sample~chr5~159845892~A~C TTG
sample~chr5~161335262~T~G GTG
sample~chr6~8559670~A~C   GTG
sample~chr6~19089049~A~C  ATG
sample~chr6~33689263~A~C  ATG
sample~chr6~41094407~T~G  ATG
sample~chr6~51706183~A~C  ATG
sample~chr6~62054457~A~C  TAA
sample~chr6~72003103~A~C  AAG
sample~chr6~72616190~T~G  GTG
sample~chr6~95431520~T~G  TTG
sample~chr6~130353772~A~C GTG
sample~chr7~47625048~A~C  GTG
sample~chr7~141102907~T~G ATG
sample~chr8~32348206~A~C  ATG
sample~chr8~32471910~T~G  ATG
sample~chr8~81956949~A~C  TTT
sample~chr8~127288437~A~C GTG
sample~chr8~128068489~T~G GTT
sample~chr8~137328297~A~C GTG
sample~chr9~35581148~T~G  TTT
sample~chr9~89588808~T~G  TTG
sample~chr9~111519982~T~G ATT
sample~chr9~130635158~T~G GTG
sample~chr9~134208675~T~G ATT
sample~chr10~34237100~T~G GTG
sample~chr10~55835644~A~C ATG
sample~chr10~62049258~A~C ATG
sample~chr10~69863766~A~C TAT
sample~chr10~70917094~A~C GTG
sample~chr10~72061220~A~C ATG
sample~chr10~79018393~A~C CTC
sample~chr10~96449725~T~G TAG
sample~chr10~105249920~T~G    ATG
sample~chr10~111876461~T~G    ATG
sample~chr10~124866598~A~C    ATG
sample~chr11~6722574~T~G  ATG
sample~chr11~8092960~A~C  CAT
sample~chr11~17730264~A~C ATG
sample~chr11~29076475~T~G GTG
sample~chr11~39318232~T~G ATG
sample~chr11~46329780~A~C CAC
sample~chr11~65167886~A~C TAG
sample~chr11~109363703~T~G    ATG
sample~chr11~115664069~T~G    ATG
sample~chr11~132136298~A~C    GTG
sample~chr12~3564695~T~G  TTG
sample~chr12~49002854~T~G GTG
sample~chr12~53959430~A~C CAA
sample~chr12~60682771~A~C ATG
sample~chr12~76522166~A~C ATG
sample~chr12~85883488~A~C TAC
sample~chr12~104972440~T~G    ATG
sample~chr12~109178764~T~G    TTT
sample~chr12~115636321~T~G    ATG
sample~chr12~131118105~A~C    GTG
sample~chr13~69622254~T~G CAA
sample~chr13~71490536~T~G ATG
sample~chr13~105961692~A~C    ATG
sample~chr14~57890027~A~C GTG
sample~chr15~40499611~A~C TAC
sample~chr15~53606021~T~G ATG
sample~chr15~64747357~T~G CAG
sample~chr15~68867588~A~C GTG
sample~chr15~70246053~T~G TTG
sample~chr15~93562724~T~G ATG
sample~chr16~3753384~A~C  TTG
sample~chr16~58122980~A~C ATG
sample~chr16~80812689~A~C TTT
sample~chr16~85554566~T~G GTG
sample~chr16~87197112~T~G ATG
sample~chr17~18989525~A~C GTG
sample~chr17~45002638~A~C ATT
sample~chr17~54062520~T~G GTG
sample~chr17~74802292~T~G TAG
sample~chr18~6208991~A~C  GAT
sample~chr18~28585148~A~C ATT
sample~chr18~28913765~T~G ATG
sample~chr18~45313179~T~G ATG
sample~chr19~1264257~A~C  CAG
sample~chr19~16689791~A~C CTG
sample~chr19~33260526~T~G ACA
sample~chr19~34846723~A~C ATG
sample~chr19~47226007~T~G ATG
sample~chr20~39824231~A~C GTG
sample~chr21~37906394~T~G GTG
sample~chr21~39166790~T~G GTG
sample~chr22~25150713~A~C ATG
sample~chr22~25973864~A~C ATG
sample~chr22~28319417~A~C TAG
sample~chr22~44068269~T~G ATG
sample~chr22~49522039~T~G ATG
sample~chrX~14875752~T~G  CAT
sample~chrX~17129347~A~C  ATG

Here is the data after fixing the mutation contexts (i.e., reverse complementing when necessary):

sample chr1 12526321 T G GTC
sample    chr1    109902767   T   G   ATG
sample chr1 113957274 T G CTG
sample chr1 163770040 T G TTG
sample chr1 171208476 T G ATG
sample    chr1    206873024   T   G   ATG
sample chr2 6814044 T G GTG
sample chr2 18660400 T G GTG
sample chr2 26512911 T G GTG
sample    chr2    46356959    T   G   ATG
sample chr2 54243644 T G GTG
sample    chr2    90073801    T   G   TTG
sample    chr2    96552875    T   G   GTG
sample chr2 119516033 T G GTG
sample chr2 120865745 T G TTG
sample chr2 183499680 T G CTT
sample chr2 198055911 T G ATG
sample    chr2    200039499   T   G   GTG
sample chr2 220416809 T G ATG
sample chr2 227361710 T G CTG
sample    chr2    233902687   T   G   GTG
sample chr3 11653930 T G CTT
sample chr3 13572181 T G GTG
sample    chr3    36857684    T   G   TTG
sample    chr3    45600576    T   G   GTT
sample    chr3    101844732   T   G   GTG
sample chr3 106106153 T G ATG
sample chr3 110803540 T G ATA
sample chr3 138472336 T G ATG
sample    chr3    187139113   T   G   TTG
sample chr4 1865397 T G ATG
sample chr4 31769221 T G ATG
sample chr4 54706065 T G GTC
sample chr4 61580710 T G ATG
sample chr4 146268637 T G TTG
sample chr4 146525132 T G GTA
sample    chr4    148938220   T   G   TTG
sample chr4 159404340 T G ATA
sample    chr4    160786763   T   G   GTG
sample chr4 185336367 T G ATG
sample chr5 40336272 T G GTG
sample chr5 43026995 T G TTT
sample    chr5    52664264    T   G   TTG
sample    chr5    112353302   T   G   GTT
sample chr5 159845892 T G TTG
sample    chr5    161335262   T   G   GTG
sample chr6 8559670 T G GTG
sample chr6 19089049 T G ATG
sample chr6 33689263 T G ATG
sample    chr6    41094407    T   G   ATG
sample chr6 51706183 T G ATG
sample chr6 62054457 T G TTA
sample chr6 72003103 T G CTT
sample    chr6    72616190    T   G   GTG
sample    chr6    95431520    T   G   TTG
sample chr6 130353772 T G GTG
sample chr7 47625048 T G GTG
sample    chr7    141102907   T   G   ATG
sample chr8 32348206 T G ATG
sample    chr8    32471910    T   G   ATG
sample chr8 81956949 T G TTT
sample chr8 127288437 T G GTG
sample    chr8    128068489   T   G   GTT
sample chr8 137328297 T G GTG
sample    chr9    35581148    T   G   TTT
sample    chr9    89588808    T   G   TTG
sample    chr9    111519982   T   G   ATT
sample    chr9    130635158   T   G   GTG
sample    chr9    134208675   T   G   ATT
sample    chr10   34237100    T   G   GTG
sample chr10 55835644 T G ATG
sample chr10 62049258 T G ATG
sample chr10 69863766 T G ATA
sample chr10 70917094 T G GTG
sample chr10 72061220 T G ATG
sample chr10 79018393 T G CTC
sample chr10 96449725 T G CTA
sample    chr10   105249920   T   G   ATG
sample    chr10   111876461   T   G   ATG
sample chr10 124866598 T G ATG
sample    chr11   6722574 T   G   ATG
sample chr11 8092960 T G ATG
sample chr11 17730264 T G ATG
sample    chr11   29076475    T   G   GTG
sample    chr11   39318232    T   G   ATG
sample chr11 46329780 T G GTG
sample chr11 65167886 T G CTA
sample    chr11   109363703   T   G   ATG
sample    chr11   115664069   T   G   ATG
sample chr11 132136298 T G GTG
sample    chr12   3564695 T   G   TTG
sample    chr12   49002854    T   G   GTG
sample chr12 53959430 T G TTG
sample chr12 60682771 T G ATG
sample chr12 76522166 T G ATG
sample chr12 85883488 T G GTA
sample    chr12   104972440   T   G   ATG
sample    chr12   109178764   T   G   TTT
sample    chr12   115636321   T   G   ATG
sample chr12 131118105 T G GTG
sample chr13 69622254 T G TTG
sample    chr13   71490536    T   G   ATG
sample chr13 105961692 T G ATG
sample chr14 57890027 T G GTG
sample chr15 40499611 T G GTA
sample    chr15   53606021    T   G   ATG
sample chr15 64747357 T G CTG
sample chr15 68867588 T G GTG
sample    chr15   70246053    T   G   TTG
sample    chr15   93562724    T   G   ATG
sample chr16 3753384 T G TTG
sample chr16 58122980 T G ATG
sample chr16 80812689 T G TTT
sample    chr16   85554566    T   G   GTG
sample    chr16   87197112    T   G   ATG
sample chr17 18989525 T G GTG
sample chr17 45002638 T G ATT
sample    chr17   54062520    T   G   GTG
sample chr17 74802292 T G CTA
sample chr18 6208991 T G ATC
sample chr18 28585148 T G ATT
sample    chr18   28913765    T   G   ATG
sample    chr18   45313179    T   G   ATG
sample chr19 1264257 T G CTG
sample chr19 16689791 T G CTG
sample chr19 33260526 T G TGT
sample chr19 34846723 T G ATG
sample    chr19   47226007    T   G   ATG
sample chr20 39824231 T G GTG
sample    chr21   37906394    T   G   GTG
sample    chr21   39166790    T   G   GTG
sample chr22 25150713 T G ATG
sample chr22 25973864 T G ATG
sample chr22 28319417 T G CTA
sample    chr22   44068269    T   G   ATG
sample    chr22   49522039    T   G   ATG
sample chrX 14875752 T G ATG
sample chrX 17129347 T G ATG

Here is the resulting mutation context summary, all of which are T>G changes:

      3 ATA
      1 ATC
     51 ATG
      4 ATT
      4 CTA
      1 CTC
      5 CTG
      3 CTT
      3 GTA
      2 GTC
     35 GTG
      3 GTT
      1 TGT
      1 TTA
     16 TTG
      5 TTT

The most abundant ones are ATG, GTG, and TTG. It does not match any cancer signatures, but it perhaps matches (though not perfectly) the SBS55 sequencing artifact signature: https://cancer.sanger.ac.uk/cosmic/signatures/SBS/SBS55.tt

Have you seen this in MF results before? Sequencing was on a Novaseq PCR-Free.

Note, this might be related to mention in this paper: https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-019-0695-x "Thymine to Guanine transversion artifacts in Guanine-rich context"

It looks like the Firevat tool may help filter these, but perhaps the RF model of MF should be able to detect sequencing artifacts too? But this is just a hypothesis. Maybe these are real.

Hi @gevro ,

As for the context, it's not always reverse complemented because I choose the context most frequently sequenced in the reads, they are not reference genome context (please refer to the code).

If you want to extract 3-nucleotide context for signature analysis, you could easily achieve that using bedtools getfasta.

Best,

Yanmei

douym commented 3 years ago

UPDATE: It looks like there is a bug in Mutect2. It is never emitting the strand_bias filter. And all the above variants would be filtered by the strand_bias filter. I have written to the GATK team about this.

Hi @gevro ,

Thanks for the update! Did you use the "sb_p" or "sb_read12_p" of MosaicForecast to filter these variants?

Best,

Yanmei

gevro commented 3 years ago

I will filter strand bias upstream right after Mutect2. But I'm curious - why doesn't MosaicForecast use that to filter automatically as part of the RF model? Or is there something about my sample that caused it to not be filtered?

douym commented 3 years ago

I will filter strand bias upstream right after Mutect2. But I'm curious - why doesn't MosaicForecast use that to filter automatically as part of the RF model? Or is there something about my sample that caused it to not be filtered?

Hi @gevro ,

strand bias is one of the features in the RF model. MosaicForecast does not use filters to filter variants step-by-step, instead, it provides non-linear feature compositions by training a machine-learning model. I wonder if you see the values in column sb_p for these variants are are significant?

THanks,

Yanmei

gevro commented 3 years ago

Here are the sb_p values for the T>G and A>C mosaic variants. Are these significant? If so, should they have been filtered out or should I manually filter out using sb_p with some threshold?

Note: I have noticed that Novaseq data, often but not always, has a large number of these types of artifacts. Perhaps the model you provide is not optimized for this kind of data so sb_p is not weighted enough in the model? If so, I can try manual filtering.

sample~chr1~12526321~T~G    0.0421455938697318
sample~chr1~109902767~T~G   0.00462523719165084
sample~chr1~113957274~A~C   0.00869565217391303
sample~chr1~163770040~A~C   0.272727272727272
sample~chr1~171208476~A~C   0.0227236612108693
sample~chr1~206873024~T~G   0.0744847014196859
sample~chr2~6814044~A~C 0.267778198812681
sample~chr2~18660400~A~C    0.0188034188034189
sample~chr2~26512911~A~C    0.0226281628720653
sample~chr2~46356959~T~G    0.0580009534403304
sample~chr2~54243644~A~C    0.1029374201788
sample~chr2~90073801~T~G    0.0308123249299719
sample~chr2~96552875~T~G    0.113064713064713
sample~chr2~119516033~A~C   0.105226841469726
sample~chr2~120865745~A~C   0.00638792102206738
sample~chr2~183499680~T~G   0.102639296187683
sample~chr2~198055911~A~C   0.00344462109167993
sample~chr2~200039499~T~G   0.00451890989988876
sample~chr2~220416809~A~C   0.0407407407407407
sample~chr2~227361710~A~C   0.0168582375478927
sample~chr2~233902687~T~G   0.0184042875922742
sample~chr3~11653930~A~C    0.030103995621237
sample~chr3~13572181~A~C    0.00945378151260503
sample~chr3~36857684~T~G    0.00831600831600832
sample~chr3~45600576~T~G    0.249999999999999
sample~chr3~101844732~T~G   0.0138941518251863
sample~chr3~106106153~T~G   0.091733870967742
sample~chr3~110803540~A~C   0.22
sample~chr3~138472336~A~C   0.0443548387096775
sample~chr3~187139113~T~G   0.12375533428165
sample~chr4~1865397~A~C 0.010673624288425
sample~chr4~31769221~A~C    0.0525030525030525
sample~chr4~54706065~A~C    0.0977777777777777
sample~chr4~61580710~A~C    0.0275569687334394
sample~chr4~146268637~A~C   0.044477028347996
sample~chr4~146525132~A~C   0.534274193548388
sample~chr4~148938220~T~G   0.112935595694216
sample~chr4~159404340~A~C   0.0634615384615383
sample~chr4~160786763~T~G   0.0184042875922742
sample~chr4~185336367~A~C   0.082758620689655
sample~chr5~40336272~A~C    0.0149124026696329
sample~chr5~43026995~A~C    1
sample~chr5~52664264~T~G    0.0260142957817376
sample~chr5~112353302~T~G   0.49217638691323
sample~chr5~159845892~A~C   0.104143492769744
sample~chr5~161335262~T~G   0.0104878436357858
sample~chr6~8559670~A~C 0.00757455373695641
sample~chr6~19089049~A~C    0.0182256767622621
sample~chr6~33689263~A~C    0.113377926421405
sample~chr6~41094407~T~G    0.0169934640522876
sample~chr6~51706183~A~C    0.113064713064713
sample~chr6~62054457~A~C    0.174358974358974
sample~chr6~72003103~A~C    0.0308972073677956
sample~chr6~72616190~T~G    0.00981047937569677
sample~chr6~95431520~T~G    0.0637254901960785
sample~chr6~130353772~A~C   0.128549711158407
sample~chr7~47625048~A~C    0.0391304347826087
sample~chr7~141102907~T~G   0.0567037625861157
sample~chr8~32348206~A~C    0.00925047438330168
sample~chr8~32471910~T~G    0.267778198812681
sample~chr8~81956949~A~C    0.501976284584979
sample~chr8~127288437~A~C   0.0130211182842762
sample~chr8~128068489~T~G   0.0956521739130436
sample~chr8~137328297~A~C   0.0190615835777126
sample~chr9~35581148~T~G    0.487179487179487
sample~chr9~89588808~T~G    0.118177000529942
sample~chr9~111519982~T~G   0.0478424015009378
sample~chr9~130635158~T~G   0.0314075034438052
sample~chr9~134208675~T~G   1
sample~chr10~34237100~T~G   0.267778198812681
sample~chr10~55835644~A~C   0.00796568627450983
sample~chr10~62049258~A~C   0.0331103678929767
sample~chr10~69863766~A~C   0.213903743315508
sample~chr10~70917094~A~C   0.0506566604127579
sample~chr10~72061220~A~C   0.00150654156204572
sample~chr10~79018393~A~C   0.0282051282051282
sample~chr10~96449725~T~G   0.0372670807453416
sample~chr10~105249920~T~G  0.113148289618878
sample~chr10~111876461~T~G  0.267692307692308
sample~chr10~124866598~A~C  0.0534139711677998
sample~chr11~6722574~T~G    0.0421455938697319
sample~chr11~8092960~A~C    0.22
sample~chr11~17730264~A~C   0.0580009534403304
sample~chr11~29076475~T~G   0.0107475896949581
sample~chr11~39318232~T~G   0.00642843637738704
sample~chr11~46329780~A~C   0.224137931034483
sample~chr11~65167886~A~C   0.0902255639097742
sample~chr11~109363703~T~G  0.213903743315508
sample~chr11~115664069~T~G  0.00108777508158313
sample~chr11~132136298~A~C  0.00256957621758381
sample~chr12~3564695~T~G    0.138089125252154
sample~chr12~49002854~T~G   0.0667155425219941
sample~chr12~53959430~A~C   0.00542269187986651
sample~chr12~60682771~A~C   0.279821627647715
sample~chr12~76522166~A~C   0.0310559006211181
sample~chr12~85883488~A~C   0.146358543417367
sample~chr12~104972440~T~G  0.0201960201960202
sample~chr12~109178764~T~G  0.231527093596059
sample~chr12~115636321~T~G  0.0275438317707505
sample~chr12~131118105~A~C  0.0658199790569089
sample~chr13~69622254~T~G   0.48
sample~chr13~71490536~T~G   0.0222897669706181
sample~chr13~105961692~A~C  0.0184042875922742
sample~chr14~57890027~A~C   0.050224313382208
sample~chr15~40499611~A~C   0.00384357559626985
sample~chr15~53606021~T~G   0.0149124026696329
sample~chr15~64747357~T~G   0.0170697012802276
sample~chr15~68867588~A~C   0.113011040193203
sample~chr15~70246053~T~G   0.276604539762434
sample~chr15~93562724~T~G   0.00188898575995351
sample~chr16~3753384~A~C    0.034920634920635
sample~chr16~58122980~A~C   0.263817663817664
sample~chr16~80812689~A~C   0.370370370370371
sample~chr16~85554566~T~G   1
sample~chr16~87197112~T~G   0.0237154150197629
sample~chr17~18989525~A~C   0.0527119938884643
sample~chr17~45002638~A~C   0.104143492769744
sample~chr17~54062520~T~G   0.00333704115684092
sample~chr17~74802292~T~G   0.0206677265500795
sample~chr18~6208991~A~C    0.050224313382208
sample~chr18~28585148~A~C   0.29575642573745
sample~chr18~28913765~T~G   0.0596293949952489
sample~chr18~45313179~T~G   0.0061669829222011
sample~chr19~1264257~A~C    0.487179487179488
sample~chr19~16689791~A~C   0.00350389321468297
sample~chr19~33260526~T~G   0.126056580990167
sample~chr19~34846723~A~C   0.0034138655462185
sample~chr19~47226007~T~G   0.28
sample~chr20~39824231~A~C   0.00284629981024669
sample~chr21~37906394~T~G   0.113377926421405
sample~chr21~39166790~T~G   0.106732348111658
sample~chr22~25150713~A~C   0.0168582375478927
sample~chr22~25973864~A~C   0.00532350532350534
sample~chr22~28319417~A~C   0.043381535038932
sample~chr22~44068269~T~G   0.0265793343457016
sample~chr22~49522039~T~G   0.0111529858749272
sample~chrX~14875752~T~G    0.24025974025974
sample~chrX~17129347~A~C    0.00583982202447163
douym commented 3 years ago

Here are the sb_p values for the T>G and A>C mosaic variants. Are these significant? If so, should they have been filtered out or should I manually filter out using sb_p with some threshold?

Note: I have noticed that Novaseq data, often but not always, has a large number of these types of artifacts. Perhaps the model you provide is not optimized for this kind of data so sb_p is not weighted enough in the model? If so, I can try manual filtering.

sample~chr1~12526321~T~G  0.0421455938697318
sample~chr1~109902767~T~G 0.00462523719165084
sample~chr1~113957274~A~C 0.00869565217391303
sample~chr1~163770040~A~C 0.272727272727272
sample~chr1~171208476~A~C 0.0227236612108693
sample~chr1~206873024~T~G 0.0744847014196859
sample~chr2~6814044~A~C   0.267778198812681
sample~chr2~18660400~A~C  0.0188034188034189
sample~chr2~26512911~A~C  0.0226281628720653
sample~chr2~46356959~T~G  0.0580009534403304
sample~chr2~54243644~A~C  0.1029374201788
sample~chr2~90073801~T~G  0.0308123249299719
sample~chr2~96552875~T~G  0.113064713064713
sample~chr2~119516033~A~C 0.105226841469726
sample~chr2~120865745~A~C 0.00638792102206738
sample~chr2~183499680~T~G 0.102639296187683
sample~chr2~198055911~A~C 0.00344462109167993
sample~chr2~200039499~T~G 0.00451890989988876
sample~chr2~220416809~A~C 0.0407407407407407
sample~chr2~227361710~A~C 0.0168582375478927
sample~chr2~233902687~T~G 0.0184042875922742
sample~chr3~11653930~A~C  0.030103995621237
sample~chr3~13572181~A~C  0.00945378151260503
sample~chr3~36857684~T~G  0.00831600831600832
sample~chr3~45600576~T~G  0.249999999999999
sample~chr3~101844732~T~G 0.0138941518251863
sample~chr3~106106153~T~G 0.091733870967742
sample~chr3~110803540~A~C 0.22
sample~chr3~138472336~A~C 0.0443548387096775
sample~chr3~187139113~T~G 0.12375533428165
sample~chr4~1865397~A~C   0.010673624288425
sample~chr4~31769221~A~C  0.0525030525030525
sample~chr4~54706065~A~C  0.0977777777777777
sample~chr4~61580710~A~C  0.0275569687334394
sample~chr4~146268637~A~C 0.044477028347996
sample~chr4~146525132~A~C 0.534274193548388
sample~chr4~148938220~T~G 0.112935595694216
sample~chr4~159404340~A~C 0.0634615384615383
sample~chr4~160786763~T~G 0.0184042875922742
sample~chr4~185336367~A~C 0.082758620689655
sample~chr5~40336272~A~C  0.0149124026696329
sample~chr5~43026995~A~C  1
sample~chr5~52664264~T~G  0.0260142957817376
sample~chr5~112353302~T~G 0.49217638691323
sample~chr5~159845892~A~C 0.104143492769744
sample~chr5~161335262~T~G 0.0104878436357858
sample~chr6~8559670~A~C   0.00757455373695641
sample~chr6~19089049~A~C  0.0182256767622621
sample~chr6~33689263~A~C  0.113377926421405
sample~chr6~41094407~T~G  0.0169934640522876
sample~chr6~51706183~A~C  0.113064713064713
sample~chr6~62054457~A~C  0.174358974358974
sample~chr6~72003103~A~C  0.0308972073677956
sample~chr6~72616190~T~G  0.00981047937569677
sample~chr6~95431520~T~G  0.0637254901960785
sample~chr6~130353772~A~C 0.128549711158407
sample~chr7~47625048~A~C  0.0391304347826087
sample~chr7~141102907~T~G 0.0567037625861157
sample~chr8~32348206~A~C  0.00925047438330168
sample~chr8~32471910~T~G  0.267778198812681
sample~chr8~81956949~A~C  0.501976284584979
sample~chr8~127288437~A~C 0.0130211182842762
sample~chr8~128068489~T~G 0.0956521739130436
sample~chr8~137328297~A~C 0.0190615835777126
sample~chr9~35581148~T~G  0.487179487179487
sample~chr9~89588808~T~G  0.118177000529942
sample~chr9~111519982~T~G 0.0478424015009378
sample~chr9~130635158~T~G 0.0314075034438052
sample~chr9~134208675~T~G 1
sample~chr10~34237100~T~G 0.267778198812681
sample~chr10~55835644~A~C 0.00796568627450983
sample~chr10~62049258~A~C 0.0331103678929767
sample~chr10~69863766~A~C 0.213903743315508
sample~chr10~70917094~A~C 0.0506566604127579
sample~chr10~72061220~A~C 0.00150654156204572
sample~chr10~79018393~A~C 0.0282051282051282
sample~chr10~96449725~T~G 0.0372670807453416
sample~chr10~105249920~T~G    0.113148289618878
sample~chr10~111876461~T~G    0.267692307692308
sample~chr10~124866598~A~C    0.0534139711677998
sample~chr11~6722574~T~G  0.0421455938697319
sample~chr11~8092960~A~C  0.22
sample~chr11~17730264~A~C 0.0580009534403304
sample~chr11~29076475~T~G 0.0107475896949581
sample~chr11~39318232~T~G 0.00642843637738704
sample~chr11~46329780~A~C 0.224137931034483
sample~chr11~65167886~A~C 0.0902255639097742
sample~chr11~109363703~T~G    0.213903743315508
sample~chr11~115664069~T~G    0.00108777508158313
sample~chr11~132136298~A~C    0.00256957621758381
sample~chr12~3564695~T~G  0.138089125252154
sample~chr12~49002854~T~G 0.0667155425219941
sample~chr12~53959430~A~C 0.00542269187986651
sample~chr12~60682771~A~C 0.279821627647715
sample~chr12~76522166~A~C 0.0310559006211181
sample~chr12~85883488~A~C 0.146358543417367
sample~chr12~104972440~T~G    0.0201960201960202
sample~chr12~109178764~T~G    0.231527093596059
sample~chr12~115636321~T~G    0.0275438317707505
sample~chr12~131118105~A~C    0.0658199790569089
sample~chr13~69622254~T~G 0.48
sample~chr13~71490536~T~G 0.0222897669706181
sample~chr13~105961692~A~C    0.0184042875922742
sample~chr14~57890027~A~C 0.050224313382208
sample~chr15~40499611~A~C 0.00384357559626985
sample~chr15~53606021~T~G 0.0149124026696329
sample~chr15~64747357~T~G 0.0170697012802276
sample~chr15~68867588~A~C 0.113011040193203
sample~chr15~70246053~T~G 0.276604539762434
sample~chr15~93562724~T~G 0.00188898575995351
sample~chr16~3753384~A~C  0.034920634920635
sample~chr16~58122980~A~C 0.263817663817664
sample~chr16~80812689~A~C 0.370370370370371
sample~chr16~85554566~T~G 1
sample~chr16~87197112~T~G 0.0237154150197629
sample~chr17~18989525~A~C 0.0527119938884643
sample~chr17~45002638~A~C 0.104143492769744
sample~chr17~54062520~T~G 0.00333704115684092
sample~chr17~74802292~T~G 0.0206677265500795
sample~chr18~6208991~A~C  0.050224313382208
sample~chr18~28585148~A~C 0.29575642573745
sample~chr18~28913765~T~G 0.0596293949952489
sample~chr18~45313179~T~G 0.0061669829222011
sample~chr19~1264257~A~C  0.487179487179488
sample~chr19~16689791~A~C 0.00350389321468297
sample~chr19~33260526~T~G 0.126056580990167
sample~chr19~34846723~A~C 0.0034138655462185
sample~chr19~47226007~T~G 0.28
sample~chr20~39824231~A~C 0.00284629981024669
sample~chr21~37906394~T~G 0.113377926421405
sample~chr21~39166790~T~G 0.106732348111658
sample~chr22~25150713~A~C 0.0168582375478927
sample~chr22~25973864~A~C 0.00532350532350534
sample~chr22~28319417~A~C 0.043381535038932
sample~chr22~44068269~T~G 0.0265793343457016
sample~chr22~49522039~T~G 0.0111529858749272
sample~chrX~14875752~T~G  0.24025974025974
sample~chrX~17129347~A~C  0.00583982202447163

Hi @gevro , Yes most of them seems very significant... Thanks for your valuable opinion! Yes I agree that if Novaseq data has this kind of bias, you could remove these variants.

THanks so much again!

Best wishes,

Yanmei