samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
663 stars 240 forks source link

Inverse consensus VCF #2096

Closed mbhall88 closed 7 months ago

mbhall88 commented 7 months ago

Hi,

I have applied the variants from a VCF to a reference with bcftools consensus and was wondering if you know of a way to get the "inverse VCF" - i.e. the VCF that would return the consensus sequence back to the original reference? The REF and ALT would be swapped, but the POS will obviously change.

It would nice to have this as an optional output to bcftools consensus if that is within scope?

pd3 commented 7 months ago

What is the motivation for the request?

There is currently no plan to support that, but it should not be difficult to write a script that takes the chain file produced with -c and the fasta file and converts it to a VCF. If it's well written, I am happy to include it among bcftools/misc scripts.

mbhall88 commented 7 months ago

The motivation is that I have applied some variants to a reference and then I want to call variants with respect to the other genome and need the "truth" VCF - i.e. the variants we expect to find.

It ended up being a reasonably easy script to write, as you say. Though I didn't use (or know about) the chain file. Is there somewhere that documents the chain file and its format?

pd3 commented 7 months ago

The format is documented fore example here https://genome.ucsc.edu/goldenPath/help/chain.html

mbhall88 commented 7 months ago

The script I came up with is essentially this python function

from collections import defaultdict

def inverse_vcf(invcf: str, outvcf: str) -> None:
    """This function takes a VCF that was used to generate a consensus sequence with bcftools
    consensus and provides the inverse - i.e., the VCF that would turn the consensus
    sequence back into the original reference sequence.
    """
    with open(outvcf, "w") as f_out, gzip.open(invcf, "rt") as f_in:
        offset_counter = defaultdict(int)
        for line in f_in:
            if line.startswith("#"):
                print(line.strip(), file=f_out)
                continue

            fields = line.strip().split("\t")
            ref = fields[3]
            alt = fields[4]
            pos = int(fields[1])
            chrom = fields[0]
            offset = offset_counter[chrom]
            new_alt = ref
            new_ref = alt
            new_pos = pos + offset
            record_offset = len(alt) - len(ref)
            offset_counter[chrom] += record_offset
            fields[1] = str(new_pos)
            fields[3] = new_ref
            fields[4] = new_alt
            print("\t".join(fields), file=f_out)