Input bam and output results

byeongill commented 1 year ago

Hi I have tumor-only WGS, consisting of nodulous and tumorous tissue. Your program is very interesting because filtering germline mutations in these samples have seemed very difficult and almost impossible. I have some question.

Should bam be processed or raw? (markduplicate, base recalibration, indel realignment)
Should I use Mutect2 vcf before doing FilterMutectCalls?
I used preprocessed bam and filtered vcf. and, the results of my data do not show SOMATIC, GERMLINE indication in INFO column. Should I classify SOMATIC using cnn-score, manually?

Thanks

sergeyvilov commented 1 year ago

Hi,

DeepSom does not need any special preprocessing of BAM files, but such preprocessing may be required by variant calling tools, e.g. Mutect (see Data pre-processing for variant discovery) From our experience, base recalibration is extremely time-consuming, but does not noticeably change the caller's output.
If you prepare training data, Mutect2 and FilterMutectCalls are usually required to get somatic variants. For subsequent inference on unlabelled data, one needs only raw output of Mutect2 in tumor-only mode (+ corresponding BAM files).
To obtain training data, one needs to call all variants with Mutect2 in a tumor-normal mode (OUTPUT1). Then, one runs Mutect2 again, this time with a matched normal and FilterMutectCalls, to get true somatic mutations (OUTPUT2), check Somatic short variant discovery (SNVs + Indels). Then, one assign a SOMATIC tag to those variants in OUTPUT1 that are also in OUTPUT2.

We're currenly working on DeepSom publication, a link to the DeepSom paper will be added as soon as it's published.

byeongill commented 1 year ago

Hi,

I understand that It doesn't matter whether it's preprocessed bam or raw bam. Actually, I sorted and indexed for preprocessed bam, but not raw bam. So, I want use to preprocessed bam.
I have a few matched samples and variants from Mutect2 tumor-normal calling (5 samples for A tissue type, 4 samples for B tissue type , 2 samples for C tissue type). I think the number of samples is too small to train. I don't prepare training data, instead used https://github.com/sergeyvilov/DeepSom/blob/main/cnn/models/gnomAD_thr_0/LINC-JP_gnomAD_thr_0_epoch_20_weights_model. Do you recommend prepare personalized train model?
In case no SOMATIC tag, how do I classify somatic variants in "final_prediction.vcf"?. I think the high cnn-score represents putative somatic mutation, is that right?

sergeyvilov commented 1 year ago

Your number of samples indeed looks too small to train. LINC-JP is a dataset of liver cancer disorders. It's better to use this dataset if your study also concerns liver cancer. I would anyway recommend calling somatic variants on your 11 samples with the GATK pipeline, then evaluate the pre-trained DeepSom LINC-JP (or any other) model on these calls. This will give you an idea of how well DeepSom will classify variants of unknown somatic status in your case.

"final_predictions.vcf" is a DeepSom output, you have no control over it. If the SOMATIC tag wasn't in the initial VCF file, it will not appear in "final_predictions.vcf". Indeed, DeepSom assigns the cnn_score, the higher the score, the higher the probability that the variant is somatic.

sergeyvilov / DeepSom

Input bam and output results #1