hsinnan75 / MapCaller

MapCaller – An efficient and versatile approach for short-read alignment and variant detection in high-throughput sequenced genomes
MIT License
30 stars 5 forks source link

Suggestion: VCF Output Formatting for RTG #19

Closed holtjma closed 4 years ago

holtjma commented 4 years ago

Hello, I'm currently testing MapCaller and using RTG vcfeval to evaluate it. Unfortunately, it doesn't work out-of-the-box due to using an unexpected VCF format. Here is the exact error message I'm receiving:

Error: Invalid VCF header. VCF header missing Number declaration on line:##INFO=<ID=TYPE,Type=String,Description="The type of allele, either SUBSTITUTE, INSERT, DELETE, or BND.">

I suspect (by viewing the VCF) that there will be other issues as well such as the location of the GT field (it's under INFO instead of a sample) that will also cause evaluation tools like RTG to fail.

hsinnan75 commented 4 years ago

Could you please show me how you modified the headers of MapCaller's VCF for evaluation? Thank you!

holtjma commented 4 years ago

Yep, here the python function I threw together to make a VCF file that worked (there are details that would probably need correcting in production). I should note I didn't bother making a GQ field, but that's also something that would generally be good to have for the calls.

def createRTG_VCF(vcfFN, sampleName, outFN):
    '''
    This will read a MapCaller VCF and reformat for RTG. Also strips out BND calls for now.
    @param vcfFN - the filename of the "vcf" made by MapCaller
    @param sampleName - the sample name to use
    @param outFN - the output filename for a reformatted VCF
    '''
    i = 0
    fpi = open(vcfFN, 'r')
    fpo = open(outFN, 'w+')

    for l in fpi:
        if i % 100000 == 0:
            print('Processing line %d...' % (i, ))
        i += 1

        if (l.startswith('##fileformat') or 
            l.startswith('##reference') or
            l.startswith('##source') or 
            l.startswith('##CommandLine') or
            l.startswith('##Contig') or
            l.startswith('##FILTER')):
            fpo.write(l)
        elif l.startswith('##'):
            #print('IGNORE: '+l.rstrip())
            pass
        elif l.startswith('#'):
            #add in all the extra headers here
            fpo.write('##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">\n')
            fpo.write('##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Depth">\n')
            fpo.write('##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allele Depth">\n')
            #CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  SL362490
            fpo.write('#'+'\t'.join(['CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO', 'FORMAT', sampleName])+'\n')
        else:
            pieces = l.rstrip().split('\t')
            chrom = pieces[0]
            pos = pieces[1]
            varID = pieces[2]
            ref = pieces[3]
            alt = pieces[4]
            qual = pieces[5]
            filterVal = pieces[6]
            origInfo = pieces[7]
            infoDict = {}
            for subStr in origInfo.split(';'):
                label, value = subStr.split('=')
                infoDict[label] = value

            if infoDict['TYPE'] == 'BND':
                continue

            #now the things we fill in
            infoField = '.'
            formatField = 'GT:DP:AD'
            newGT = infoDict.get('GT', './.').replace('|', '/')
            newDP = infoDict['DP']
            newAD = str(int(infoDict['DP']) - int(infoDict['AD']))+','+infoDict['AD']
            sampleField = ':'.join([newGT, newDP, newAD])

            fpo.write('\t'.join([chrom, pos, varID, ref, alt, qual, filterVal, infoField, formatField, sampleField])+'\n')

    fpo.close()
    fpi.close()