Open jsaquing opened 3 years ago
How much of an issue is this? Does this fall under "nice to optimize" or "we have to do this because we are dead in the water"?
It's something I'd like to try optimizing eventually, but it isn't necessary right now (hence the postponed label). I just wanted to write this down for future reference.
Exon
objects in Biosurfer are considered different if they belong to differentTranscript
objects, even if they have the same coordinates. However, this leads to duplication of information in the underlying SQL database's exon table and longinsert
query times when populating the database. (I haven't measuredselect
query times specifically but I imagine they are impacted as well.)A possible solution might be to have the exon table only store a single row for each unique exon (as defined by its genomic coordinates), while somehow still mapping each
Exon
object to a combination of a genomic exon and a transcript from the database. It seems like this can be accomplished by having a transcript-exon association table (where column 1 holds transcript IDs, column 2 holds exon IDs, and each row represents a transcript-includes-exon relationship) and then mappingExon
against a join of the association and exon tables, as per [1].