lenaschimmel / sc2rf

SARS-Cov-2 Recombinant Finder for fasta sequences
MIT License
48 stars 13 forks source link

Terminal Ns not recognized as missing #29

Open ktmeaton opened 2 years ago

ktmeaton commented 2 years ago

While investigating https://github.com/cov-lineages/pango-designation/issues/590, I noticed that samples with the BA.2 S2M deletion (29734:29759) were being incorrectly visualized as having reference bases in sc2rf:

Consensus View: image

sc2rf View: image

I think this could be for a couple of reasons:

  1. When --enable-deletions is used, perhaps deletions should not be considered missing data?

    missings_matches = ["N"]
    if not args.enable_deletions:
        missings_matches.append("-")
  2. I think there is missing logic when detecting a run of Ns, to catch if that runs proceeds to the end of the genome?

    if s in missings_matches:
        # we've been tracking a run of N's, this base marks the end              
        if start_n == -1:
            start_n = i  # mark the start of possible run of N's
    elif start_n >= 0:
        missings.append((start_n, i-1))  # Python-style (closed, open) interval
        start_n = -1
    
    # Missing logic to catch missing data at the end of the genome?
    if i == len(reference) and s in missings_matches:
        missings.append((start_n, i-1))

With these changes, the sc2rf output more closely matches the consensus sequence/my expectation:

image

I think this is a bug, but if it's the intended behaviour for deletions, please let me know. Thanks!

ktmeaton commented 2 years ago

Sorry, the code for both of these is:

python3 sc2rf.py pango_designation_590.fasta --ansi --unique 1 --enable-deletions