nextstrain / tb

Nextstrain build for tuberculosis
https://nextstrain.org/tb
2 stars 8 forks source link

Add one to each BED mask interval to work with corrected augur mask #4

Open huddlej opened 4 years ago

huddlej commented 4 years ago

augur mask now reads in BED files following the standard expectation of a zero-indexed, half-open interval such that the last value in each interval is not included in the coordinates [1]. This commit updates the mask BED file for this build to increment each interval by one to compensate this change in augur mask.

[1] https://github.com/nextstrain/augur/pull/512#issuecomment-608962457

huddlej commented 4 years ago

Ok, this took four attempts, but I think I've worked it out. The change here is simple but the reasoning involves annoying coordinate bookkeeping. Here is an example.

In the original augur mask implementation the following BED file,

SEQ    3    5

was converted to 1-indexed positions 3, 4, 5.

The standard BED file format should read these coordinates into the 0-indexed positions 3, 4. These positions correspond to the following 1-indexed positions that would be expected by vcftools 4, 5.

To get the expected 1-indexed positions for vcftools from a BED file, we need to decrement the interval start by 1:

SEQ    2    5

This produces the 0-indexed positions of 2, 3, 4 and the 1-indexed positions of 3, 4, 5.

genehack commented 4 days ago

@huddlej @emmahodcroft Is this still relevant, or can this old PR be closed out?