Refactor Exon to map against an association table between the transcript and exon tables

jsaquing commented 3 years ago

Exon objects in Biosurfer are considered different if they belong to different Transcript objects, even if they have the same coordinates. However, this leads to duplication of information in the underlying SQL database's exon table and long insert query times when populating the database. (I haven't measured select query times specifically but I imagine they are impacted as well.)

A possible solution might be to have the exon table only store a single row for each unique exon (as defined by its genomic coordinates), while somehow still mapping each Exon object to a combination of a genomic exon and a transcript from the database. It seems like this can be accomplished by having a transcript-exon association table (where column 1 holds transcript IDs, column 2 holds exon IDs, and each row represents a transcript-includes-exon relationship) and then mapping Exon against a join of the association and exon tables, as per [1].

gsheynkman commented 3 years ago

How much of an issue is this? Does this fall under "nice to optimize" or "we have to do this because we are dead in the water"?

jsaquing commented 3 years ago

It's something I'd like to try optimizing eventually, but it isn't necessary right now (hence the postponed label). I just wanted to write this down for future reference.

sheynkman-lab / biosurfer

Refactor Exon to map against an association table between the transcript and exon tables #85