We are likely to walk into genomes too large to efficiently store in MongoDB w/ its max doc size of 16 mb.
GridFS?
supplementary fasta with the primary document key as header
I prefer the latter, because it seems less dependent on the db architecture.
We could implement random access to fasta (or use pyfaidx) easily, so would be quite efficient to retrieve a given sequence once we're computing stuff.
We are likely to walk into genomes too large to efficiently store in MongoDB w/ its max doc size of 16 mb.
I prefer the latter, because it seems less dependent on the db architecture.
We could implement random access to fasta (or use pyfaidx) easily, so would be quite efficient to retrieve a given sequence once we're computing stuff.