sigven / pcgr

Personal Cancer Genome Reporter (PCGR)
https://sigven.github.io/pcgr
MIT License
254 stars 48 forks source link

control_dp_min is not properly transfered #136

Closed gudeqing closed 3 months ago

gudeqing commented 3 years ago

image

sigven commented 3 years ago

Hi @gudeqing, Thanks a lot for noticing this! Indeed a bug, pushed a fix to pcgr.py now.

best, Sigve

gudeqing commented 3 years ago

Hi @sigve, will you check if the following arguments work as expected ? --tumor_dp_min 5 \ --tumor_af_min 0.05 \ --tumor_dp_tag TDP \ --tumor_af_tag TAF \ --control_dp_tag NDP \ --control_af_tag NAF

best wishes !

sigven commented 3 years ago

Sure thing, if you have an example VCF that you think adheres to these parameters, I could do such a check. I would also need the assembly version (grch37 or 38)

gudeqing commented 3 years ago

Hi, sigven, Nice to have it work properly now.
By the way, here is a small script that helps to copy "DP/AF" in FORMAT to INFO field, However, I still hope by default, pcgr will extract info from FORMAT, and I believe this will cover 90% real cases of somatic variant calling. Simply, you can also allow users to specify the tag location, such as --tumor_dp_tag FOMAT/DP --tumor_af_tag INFO/AF

from pysam import VariantFile

def reformat_vcf(vcf_file, out):
    with VariantFile(vcf_file) as fr:
        header = fr.header
        header.info.add('TDP', number=1, type='Integer', description='Tumor sample depth')
        header.info.add('NDP', number=1, type='Integer', description='Normal sample depth')
        header.info.add('TAF', number=1, type='Float', description='Tumor sample AF')
        header.info.add('NAF', number=1, type='Float', description='Normal sample AF')

        with VariantFile(out, 'w', header=header) as fw:
            for record in fr:
                record.info['TDP'] = record.samples[1]['DP']
                record.info['NDP'] = record.samples[0]['DP']
                record.info['TAF'] = record.samples[1]['AF'][0]
                record.info['NAF'] = record.samples[0]['AF'][0]
                fw.write(record)
sigven commented 3 years ago

Hi @gudeqing,

Cool! Very nice small script for this purpose. Although I agree that retrieving this information directly from the FORMAT column would by far be most convenient for most people, the fact that there is no concensus (among callers) on how to format sample-specific variant data in the FORMAT column makes it inevitably fairly challenging to parse this information robustly. This has at least been my experience over the years, I have seen callers with poor consistency in their way of formatting. E.g. how can one be sure that record.samples[1] denotes the tumor sample? And how do you ensure that the record.samples[1]['AF'][0] is a single value and not a comma-delimited value (for multiallelic calls)? So IMO there are some potential pitfalls here, but if you can demonstrate that a robust parsing scheme for this across the most common somatic callers, I am happy to have a look at it for integration into PCGR :-)

gudeqing commented 3 years ago

Hi, @sigven ,

First, pcgr requires the user to provide tumor sample id, thus we could use this ID to differentiate tumor sample from normal sample. Second, in most times, we can guess which sample is tumor sample by comparing AF between two samples.

Here is a small script that first splits multi-allelic records, autodetects tumor sample, and reformats INFO field. I have tested this script with 28 vcf files, tumor sample is 100% correctly guessed.

from pysam import VariantFile
import os

def reformat_vcf(vcf_file, out, reference, tumor_sample=None):
    """
    [split and left-normalization] -> [re-calculate AF] -> [bgzip and tabix]
    :param vcf_file:
    :param out:
    :param reference:
    :param tumor_sample: tumor sample name or position, if the second sample is tumor, then it is 1 else 0
    :return:
    """
    os.system(f'bcftools norm -f {reference} -m -both {vcf_file} -o tmp_.vcf')
    with VariantFile('tmp_.vcf') as fr:
        header = fr.header
        header.info.add('TDP', number=1, type='Integer', description='Tumor sample depth')
        header.info.add('NDP', number=1, type='Integer', description='Normal sample depth')
        header.info.add('TAF', number=1, type='Float', description='Tumor sample AF')
        header.info.add('NAF', number=1, type='Float', description='Normal sample AF')
        samples = list(header.samples)

        if tumor_sample is not None:
            if tumor_sample not in [0, 1, '0', '1']:
                if tumor_sample in samples:
                    tumor_idx = samples.index(tumor_sample)
                    normal_idx = 1 - tumor_idx
                else:
                    raise Exception(f'{tumor_sample} is not in samples {samples} recorded in vcf')
            else:
                tumor_idx = int(tumor_sample)
                normal_idx = 1 - tumor_idx
        else:
            tumor_idx = guess_tumor_idx(vcf_file)
            normal_idx = 1 - tumor_idx

        with VariantFile(out, 'w', header=header) as fw:
            for record in fr:
                record.info['TDP'] = record.samples[tumor_idx]['DP']
                record.info['NDP'] = record.samples[normal_idx]['DP']
                # re-calculate AF since caller like sentieon may report AF that is not consistent with AD info
                record.info['TAF'] = round(record.samples[tumor_idx]['AD'][1]/record.samples[tumor_idx]['DP'], 3)
                record.info['NAF'] = round(record.samples[normal_idx]['AD'][1]/record.samples[normal_idx]['DP'], 3)
                fw.write(record)

    os.remove('tmp_.vcf')
    os.system(f'bgzip {out}')
    os.system(f'tabix {out}.gz')

def guess_tumor_idx(vcf_file):
    tumor_is_first = 0
    tumor_is_second = 0

    with VariantFile(vcf_file) as fr:
        samples = list(fr.header.samples)
        formats = list(fr.header.formats)
        if 'AF' not in formats:
            raise Exception('No AF in format info to detect tumor sample')
        for record in fr:
            if record.samples[0]['AF'][0] > record.samples[1]['AF'][0]:
                tumor_is_first += 1
            else:
                tumor_is_second += 1
    tumor_idx = tumor_is_second >= tumor_is_first
    print(f'we guess tumor sample is {samples[tumor_idx]} ')
    return tumor_idx

if __name__ == '__main__':
    from xcmds import xcmds
    xcmds.xcmds(locals())
sigven commented 2 years ago

Hi @gudeqing, I will follow up now on the great contribution you have done here. I am in the process of updating PCGR, and will be looking into how your work could be incorporated, either fully, or as a utility that can be offered to the users.