jeromekelleher / sc2ts

Infer a succinct tree sequence from SARS-COV-2 variation data
MIT License
4 stars 3 forks source link

Preprocess daily batch to find identical sequences? #158

Open jeromekelleher opened 1 year ago

jeromekelleher commented 1 year ago

It may be worth preprocessing each daily batch of sequences to find those that are identical. It might look like:

  1. Bin all the sequences from a particular key (masked sequence) into a default dict
  2. Run matching just on the keys
  3. Postprocessing the paths groupings to add in all the identical sequences that are in that group with the same values as the exemplar.

Older code did something like this, but kept a running track of all identical sequences. This would be simpler, and would not require state across calls.

It's unclear whether the number of identical sequences in a daily batch would actually warrant this - I suspect it would be a small performance boost.