Open jeromekelleher opened 11 months ago
These ideas are not in conflict with also being able to scale work out across processors and servers, and can apply even for messy data. There was a lot of effort in the Hadoop ecosystem to identify compression codecs that were splittable (our friend Tom White wrote about the topic in his book) and had the right tradeoff of computation and storage efficiency (e.g. Snappy was an improvement at the time). Much of the work since then has gone into using instruction set extensions to make hardware-friendly codecs, and algorithms to operate directly on compressed data, discussed as far back as Data compression and database performance (1991) for example.
This was a nice prompt to scan for recent work on this topic in the databases world; https://github.com/maxi-k/btrblocks looks quite interesting!
Loh et al argue for the idea of compressive genomics and follow up with ideas of Compressive acceleration.
These are attractive ideas, but only work in certain situations and cleaned up data. We will always start out with messy variant calls initially, and we need a software stack and data structures to work with this.