graph-genome / component_segmentation

Read in ODGI Bin output and identify co-linear components
Apache License 2.0
3 stars 4 forks source link

v15: Sparse JSON output #29

Closed lomereiter closed 3 years ago

lomereiter commented 4 years ago

Proposed output JSON tweaks:

  1. Each link column appears twice in the output - in arrivals for one bin and departures for another. Instead, link columns could be stored in a separate array, and each component would link to its elements via indices.
  2. In each link column, store participants as an array of path indices instead of a boolean mask.
josiahseaman commented 4 years ago

This will only get you a factor of 2x reduction at best. If you want to improve the JSON, I'd recommend changing the whole thing to a Map rather than a List. This could get you 10-50x smaller files. Most of the information tends to be sparse, so a giant List of [false,false,false,...] is just wasteful. I attempted this in a branch https://github.com/graph-genome/component_segmentation/tree/experimental_v6_sparse_matrix but ran into bugs and got discouraged.

We may end up removing "occupants" or making "participants" sparse at a later date, since these are simply large expanded precomputes for display convenience. But that change is not v13.

Any change here will also require significant code changes in Schematize which is expecting a List and will hopefully do as little analysis as possible.

josiahseaman commented 3 years ago

As best as I can tell this issue was addressed and was never closed by mistake. The new output does use indices instead of true false arrays.