I have written some tests which break the current implementation of vcf rebasing in the discover command.
For the record, vcf rebasing expresses the variation discovered on top of an inferred reference (coming out of quasimap and then infer) in original prg coordinates so that we can augment the prg.
Here is a summary of two tests which break the current implementation. In the code, the secondary_vcf_record is the record expressed against inferred reference coordinates and which we rebase to base reference coordinates.
I would like to make sure what output we expect.
def test_RecordOverlapsVariantAndNonVariantSites(self):
"""
A test case where the variation on top of the inferred reference overlaps: a non-variant site, a variant site,
and a non-variant site in the prg.
What we need is for the rebased REF to include all three sites.
"""
# base reference: T TAT CGG
# inferred reference: T G CGG
secondary_vcf_record = _MockVcfRecord(POS=1, REF='TGCG', ALT=['GGCT'])
expected_POS = 1; expected_REF = 'TTATCG'; expected_ALT = ['GGCT']
secondary_regions = discover._flag_personalised_reference_regions(base_records, secondary_reference_length)
new_vcf_record = discover._rebase_vcf_record(secondary_vcf_record, secondary_regions)
self.assertEqual(new_vcf_record.REF, expected_REF)
Value
Expected output
Cur. Implementation Output
POS
1
1
REF
'TTATCG'
'TGCG'
ALT
'GGCT'
'GGCT'
def test_SNP_OnTopOfIndel(self):
"""
A test case where we find a SNP atop the inferred reference inside an insertion.
What we need is for the rebased ALT to include the flanking bases which are not present in the base sequence.
"""
# base reference: T TAT CGG T A
# inferred reference: T G CGG TCTGC A
secondary_reference_length = 8
base_records = [
_MockVcfRecord(POS=2, REF="TAT", ALT=['G']),
_MockVcfRecord(POS=8, REF="T", ALT=['TCTGC'])
]
secondary_vcf_record = _MockVcfRecord(POS=9, REF='G', ALT=['A'])
secondary_regions = discover._flag_personalised_reference_regions(base_records, secondary_reference_length)
new_vcf_record = discover._rebase_vcf_record(secondary_vcf_record, secondary_regions)
expected_POS = 8; expected_REF = 'T'; expected_ALT = 'TCTAC'
self.assertEqual(new_vcf_record.ALT, expected_REF)
Value
Expected output
Cur. Implementation Output
POS
8
1
REF
'T'
'T'
ALT
'TCTAC'
'TAGC'
I am working on an implementation which could solve this.
I have written some tests which break the current implementation of vcf rebasing in the
discover
command.For the record, vcf rebasing expresses the variation discovered on top of an inferred reference (coming out of
quasimap
and theninfer
) in original prg coordinates so that we can augment the prg.Here is a summary of two tests which break the current implementation. In the code, the
secondary_vcf_record
is the record expressed against inferred reference coordinates and which we rebase to base reference coordinates.I would like to make sure what output we expect.
I am working on an implementation which could solve this.