large sequences - Githubissues

phiweger / zoo

A portable datastructure for rapid prototyping in (viral) bioinformatics (under development).

5 stars 2 forks source link

large sequences #58

Open phiweger opened 7 years ago

phiweger commented 7 years ago

We are likely to walk into genomes too large to efficiently store in MongoDB w/ its max doc size of 16 mb.

GridFS?
supplementary fasta with the primary document key as header

I prefer the latter, because it seems less dependent on the db architecture.

We could implement random access to fasta (or use pyfaidx) easily, so would be quite efficient to retrieve a given sequence once we're computing stuff.

phiweger commented 7 years ago

if seq empty:

link: accession/ filepath, UUID from _id field

links to related sequences

link: accession/ filepath, someid, description