luntergroup / octopus

Bayesian haplotype-based mutation calling
MIT License
305 stars 38 forks source link

LBQ filter on all somatic variant in UMI cancer paired data #210

Open X-Mialhe opened 3 years ago

X-Mialhe commented 3 years ago

Describe the bug An analysis on UMI cancer paired data with the option below, all somatic variants present LBQ filter at least. It's odd because on an the same analysis but with default filtering or with random forest analysis I get some PASS somatic variants. I provide you bam files on a variant which I am sure of the presence. I also provide you the full somatic VCF. Also, on my random forest run, the great majority for my variants have RFGQ_all ~3 which is quite low so I don't think these models are appropriate on my data, unless these scores are normal for a filter of this type ?

Feel free to tell me if I am not clear enough !

Thanks, Xavier Mialhe Version

$ octopus --version
octopus v0.7.4 (develop 815f4f05)

Command Command line to install octopus:

singularity build octopus.def 

Command line to run octopus:

singularity exec --bind '/data' /data/software/octopus/octopus.sif octopus \
    -R /data/Genome_data/Homo_sapiens/Hg19/Sequence/Genome/Hg19_concatenate.fa \
    -I exemple_TP53_sample_1F.bam exemple_TP53_sample_1S.bam --normal-sample 1S.normal \
    --threads 4 -B 12G \
    -P 2 \
    -o ./exemple_TP53_1F.vcf \
    --regions-file /data/210811_A00924_0218_AHGNL2DRXY_Goze/Unaligned/Goze_C_1/S3128717_Covered_tabSeparated.bed \
    --sequence-error-model PCR.NOVASEQ \
    --allow-octopus-duplicates \
    --downsample-above 1000 --downsample-target 1000 \
    --somatic-filter-expression "GQ < 100 | MQ < 30 | SB > 0.2 | SD[.25] > 0.1 | BQ < 40 | DP < 100 | MF > 0.1 | AD < 5 | CC > 1.1 | REB > 0.2" \
    --annotations AF AD \
    --bamout ./1F.realigned.bam \
    --bamout-type FULL \
    > ./1F.test_UMI.log 2>&1
dancooke commented 3 years ago

Thanks for the report and data. This turns out to be quite trivial: the term BQ < 40 in --somatic-filter-expression is what triggers the LBQ. Adding BQ to --annotations shows this is being set appropriately:

chr17   7578190 .   T   C   10210.9 LBQ AC=1;AN=5;DP=1106;MP=21.04;MQ=60;NS=2;PP=10006.2;SOMATIC    GT:GQ:DP:MQ:PS:PQ:HSS:HPC:MAP_HF:HF_CR:AD:AF:BQ:FT  1|0|0:3076:605:60:7578115:10:1,0,0:891.819,92,92:0.83,0.085,0.085:0.81,0.85,0.072,0.1,0.072,0.1:149,689:0.178,0.822:37,37:LBQ   0|0:3076:501:60:7578115:10:0,0:441.643,439.357:0.5,0.5:0.47,0.53,0.47,0.53:720,.:1,.:37,.:PASS

It's been a while since I put together the UMI config, but BQ < 40 seems unreasonably restrictive; I think it's probably a mistake as I didn't add a note as with the other modified metrics (the default is BQ < 20).

X-Mialhe commented 3 years ago

Thanks a lot, It was very simple indeed... Do you have any recommendations on what type of filtering I should use on this data type ? I'm not very used to dealing with UMIs. Do you think the forest files you provide are effective or I keep trying to adjust the thresholds ?