iqbal-lab-org / gramtools

Genome inference from a population reference graph
MIT License
92 stars 15 forks source link

Vcf rebasing in `discover` is not exhaustive #143

Closed bricoletc closed 5 years ago

bricoletc commented 5 years ago

I have written some tests which break the current implementation of vcf rebasing in the discover command.

For the record, vcf rebasing expresses the variation discovered on top of an inferred reference (coming out of quasimap and then infer) in original prg coordinates so that we can augment the prg.

Here is a summary of two tests which break the current implementation. In the code, the secondary_vcf_record is the record expressed against inferred reference coordinates and which we rebase to base reference coordinates.

I would like to make sure what output we expect.

    def test_RecordOverlapsVariantAndNonVariantSites(self):
        """
        A test case where the variation on top of the inferred reference overlaps: a non-variant site, a variant site,
        and a non-variant site in the prg.

        What we need is for the rebased REF to include all three sites.
        """
        # base reference:     T TAT CGG
        # inferred reference: T G   CGG

        secondary_vcf_record = _MockVcfRecord(POS=1, REF='TGCG', ALT=['GGCT'])

        expected_POS = 1; expected_REF = 'TTATCG'; expected_ALT = ['GGCT']

        secondary_regions = discover._flag_personalised_reference_regions(base_records, secondary_reference_length)
        new_vcf_record = discover._rebase_vcf_record(secondary_vcf_record, secondary_regions)

        self.assertEqual(new_vcf_record.REF, expected_REF)
Value Expected output Cur. Implementation Output
POS 1 1
REF 'TTATCG' 'TGCG'
ALT 'GGCT' 'GGCT'
    def test_SNP_OnTopOfIndel(self):
        """
        A test case where we find a SNP atop the inferred reference inside an insertion.

        What we need is for the rebased ALT to include the flanking bases which are not present in the base sequence.
        """
        # base reference:     T TAT CGG T      A
        # inferred reference: T G   CGG TCTGC A
        secondary_reference_length = 8
        base_records = [
            _MockVcfRecord(POS=2, REF="TAT", ALT=['G']),
            _MockVcfRecord(POS=8, REF="T", ALT=['TCTGC'])
        ]

        secondary_vcf_record = _MockVcfRecord(POS=9, REF='G', ALT=['A'])

        secondary_regions = discover._flag_personalised_reference_regions(base_records, secondary_reference_length)
        new_vcf_record = discover._rebase_vcf_record(secondary_vcf_record, secondary_regions)

        expected_POS = 8; expected_REF = 'T'; expected_ALT = 'TCTAC'

        self.assertEqual(new_vcf_record.ALT, expected_REF)
Value Expected output Cur. Implementation Output
POS 8 1
REF 'T' 'T'
ALT 'TCTAC' 'TAGC'

I am working on an implementation which could solve this.