jeromekelleher / sc2ts

Infer a succinct tree sequence from SARS-COV-2 variation data
MIT License
4 stars 3 forks source link

Refactor: change import-fasta fasta-to-zarr #37

Closed jeromekelleher closed 1 year ago

jeromekelleher commented 1 year ago

Converting to a per-day .samples file is problematic because we have to do the QC filtering when we create the samples file and information is lost when masking out sites. We would need to create multiple converted files to represent the same data.

Better to do a complete lossless conversion of the fasta alignments to a single large zarr which we sequentially append to. The zarr would contain the exact alignments as numpy string arrays. So we'd have:

This lets us integrate the inference and filtering steps much more tightly, and avoids having to regenerate hundreds of .samples files each time.

It also lets us avoid a file dependency on tsinfer's .samples, which we don't want.

jeromekelleher commented 1 year ago

Closing this as not-quite relevant. After investigation dealing with zarr's data integrity issues wasn't worth the extra compression and we went with a simpler key-value store based option for storing the alignments.