When --enable-deletions is used, perhaps deletions should not be considered missing data?
missings_matches = ["N"]
if not args.enable_deletions:
missings_matches.append("-")
I think there is missing logic when detecting a run of Ns, to catch if that runs proceeds to the end of the genome?
if s in missings_matches:
# we've been tracking a run of N's, this base marks the end
if start_n == -1:
start_n = i # mark the start of possible run of N's
elif start_n >= 0:
missings.append((start_n, i-1)) # Python-style (closed, open) interval
start_n = -1
# Missing logic to catch missing data at the end of the genome?
if i == len(reference) and s in missings_matches:
missings.append((start_n, i-1))
With these changes, the sc2rf output more closely matches the consensus sequence/my expectation:
I think this is a bug, but if it's the intended behaviour for deletions, please let me know. Thanks!
While investigating https://github.com/cov-lineages/pango-designation/issues/590, I noticed that samples with the BA.2 S2M deletion (29734:29759) were being incorrectly visualized as having reference bases in sc2rf:
Consensus View:
sc2rf View:
I think this could be for a couple of reasons:
When
--enable-deletions
is used, perhaps deletions should not be considered missing data?I think there is missing logic when detecting a run of Ns, to catch if that runs proceeds to the end of the genome?
With these changes, the sc2rf output more closely matches the consensus sequence/my expectation:
I think this is a bug, but if it's the intended behaviour for deletions, please let me know. Thanks!