hms-dbmi / cistrome-explorer

Interactive visual analytic tool for exploring epigenomics data w/ associated metadata, powered by HiGlass and Gosling
http://cisvis.gehlenborglab.org
13 stars 0 forks source link

Determine best format for storing sample hierarchies and similarity distance measurements #10

Open keller-mark opened 4 years ago

keller-mark commented 4 years ago

Keep in mind that hierarchical clustering results are coming from python (scipy? may be good to confirm).

Hierarchical clusterings returned by scipy can easily be transformed into a hierarchy of nested JSON objects: see stackoverflow, explosig-server

This nested JSON object can be passed to d3.hierarchy to draw dendrograms: vueplotlib, d3 docs

However, this nesting is not ideal because it is difficult to manipulate for filtering, etc. For example, it would be better to store information about each sample's location in the hierarchy in each samples metadata object, rather than having a standalone object representing multiple samples.

See my experiments in this observable notebook for an example of a transformation that can be done to separate the nested JSON object into a 2-dimensional array of unique identifier values, which can subsequently be transformed back into the nested JSON object.

Note that further exploration should still be done to ensure that the best compact hierarchy representation is chosen.

keller-mark commented 4 years ago

Also see python experiments here https://github.com/keller-mark/clodius-cistrome-example/blob/master/src/split.py#L5

keller-mark commented 4 years ago

Other good options may be the vega "nest" and "stratify" transforms

sehilyi commented 4 years ago

We can suggest for the best data format that works for us.