iqbal-lab-org / gramtools

Genome inference from a population reference graph
MIT License
92 stars 15 forks source link

Consider storing per-base coverage info in uint16_t #106

Closed iqbal-lab closed 6 years ago

iqbal-lab commented 6 years ago
  1. Genome sequencing depth is generally <60x for big genomes, <100x for bacterial genomes and <10000x for viral genomes. Repetitive copy number is (roughly) highest in big, and lowest in viral genomes.
  2. We cannot draw good inference in repeat regions, so either they will be low confidence or even masked.

So, we might as well store coverage with a cap, and above that ceiling we don't count. eg use uint16_t, with max 65536.

We did something like this in cortex here: https://github.com/iqbal-lab/cortex/blob/59658afe054e3a4b3854dc57954547be5383a9e9/src/cortex_var/many_colours/element.c#L421

This could save us a lot of RAM.

iqbal-lab commented 6 years ago

This has been discussed previously, just wanted an issue so we don't lose it. There might easily be more sophisticated things we might do, either using SDSL or using uint16_t during initial quasimap, and then a larger int when re-mapping (but not at all sure the latter is worth it)

ffranr commented 6 years ago

Implemented here: https://github.com/iqbal-lab-org/gramtools/commit/442ddfe7d68fca86ad97a123792673edff815b04

Leaving issue open for a few days.

iqbal-lab commented 6 years ago

hey this is exciting