Planning: hsurf reconstruction workflow

how to forward data???

dstream responsibility:

dstream.dataframe.data_unpack_packed
dstream.dataframe.lookup_explode_unpacked
dstream.dataframe.lookup_explode_packed
dstream raw buffer format (i.e., genome dumps):
- columns: data_hex (hex)
- optional: data_id (i.e., taxon_id), downstream_version (string; warn if missing)
- schema: dstream_T_bitoffset, dstream_T_bitsize dstream_storage_bitoffset, dstream_storage_bitsize, (all measured in bits; use assert to enforce that these are even bytes)
- dstream_S, dstream_algo (string)
- rows: one per taxon
- should have a dstream tool for this (dstream.extract) dstream_long_format
dstream parsed buffer format:
- columns: dstream_S, dstream_T, dstream_storage_hex (hex string), dstream_algo (string)
- optional: data_id (i.e., taxon_id), downstream_version (string; warn if missing)
- rows: one per taxon
dstream long format:
- columns: hstrat_version, dstream_k, dstream_Tbar, dstream_T, dstream_value_bitsize dstream_value_hex
- optional: data_id (i.e., taxon_id), downstream_version (string; warn if missing)
- rows: one differentia of one taxon

pipeline input:

hsurf long format:
- columns: hstrat_version, taxon_id, num_strata_deposited, rank, differentia, differentia_bit_width
- optional: data_id (i.e., taxon_id), hstrat_version (string; warn if missing)
- optional columns: origin_time
- rows: one differentia of one taxon
in-memory representation to pass to C++ bindings:
- two fixed-dimension numpy tables (rank, differentia) and a vector (taxon_id)
- rows: taxa
- columns: nth sorted stratum (note: all have same number columns)

pipeline output:

reconstructed alifestd phylogeny
- columns: taxon_id, origin_time, ancestor_id, ancestor_ids, differentia_bit_width
- rows: one taxon

PIPELINE:

mmore500 / hstrat