nih-cfde / cfde-deriva

Collaboration point for miscellaneous CFDE-deriva scripts
Other
2 stars 3 forks source link

Performance regression with too many core fact combinations #301

Closed karlcz closed 2 years ago

karlcz commented 2 years ago

As dimensions have been added to core fact, a new submission can generate too many permutations and erase the performance improvements gained by pre-aggregating c2m2 entities into equivalence classes.

It seems that (substances, compounds) and (genes) may benefit from being split into separate fact tables.

karlcz commented 2 years ago

A revision to split out gene_fact and pubchem_fact tables has been pushed. It has been manually tested on a few submissions in dev, but needs full end-to-end testing via submission pipeline and browser.

karlcz commented 2 years ago

A test build on app-dev shows more reasonable fact table sizes, with around 10k facts for 3M files. The pubchem facts are most numerous, due to the LINCS assays with many distinct compounds.