Start `piker.storage` subsys: cross-(ts)db middlewares

goodboy commented 1 year ago

Launch pad for work towards the task list in #485 🏄🏼

As a start this introduces a new piker.storage subsystem to provide for database related middleware(s) as well as new storage backend using polars and apache parquet files to implement a built-in, local-filesystem managed "time series database": nativedb.

After some extensive tinkering and brief performance measures I'm tempted to go all in on this home grown solution for a variety of reasons (see details in 27932e44) but re-summarizing some of them here:

wayy faster load times with no "datums to load limit" required.
smaller space footprint and we haven't even touched compression settings yet!
wayyy more compatible with other systems which can lever the apache ecosystem.
gives us finer grained control over the filesystem usage so we can choose to swap out stuff like the replication system or networking access.
polars already has a multi-db compat layer with multi-engine support we can leverage and completely sidestep integration work with multiple standard tsdbs?
- https://pola-rs.github.io/polars-book/user-guide/io/database/

Core dev discusssion

[ ] we've put some work into marketstore support machinery including:
- docker supervisor subsys, spawning, monitoring (which is still super useful going forward FWIW!).
- anyio-marketstore an async client written and maintained by our devs.
- originally using the rt ingest and re-publish over websocket features. AND the question is whether we're ok abandoning some of this and/or reimplementing it on our own around the new apache data file/model ecosystems?
[ ] we can definitely accomplish ingest, pub-sub and replication on our own (without really much effort) with the following existing subsystems and frameworks:
- ingest: tractor actor which writes to apache arrow (IPC) files and flushes to parquet on size constraints.
- pub-sub: again with tractor actor and trio-websocket
- replication: use a modern filesystem (btrfs or zfs) and or something like borg (with it's unofficial API client) to accomplish file syncing across many user-hosts.
- borg has a community API: https://github.com/spslater/borgapi
- other file systems:
  - https://nixos.wiki/wiki/ZFS
[ ] should we drop all the existing marketstore code?
- it's quite a bit of noise and not going to work anyway given the new implementation changes in the .data.history layer.
- the issues we've reported are not getting resolved and are more or less deal breakers, like https://github.com/pikers/piker/issues/443
- likely the new arcticdb is a better solution longer run then mkts was anyway given it's large insti usage..?

ToDo:

[x] CHERRY from #519:
- 641726b
- b5d66b3
- cc7fe6f3
- d6f3fd9a
- b6b53f71
- b71c508b
- a295312a
- 4ae6c367
- f5c9659a
- fcd45766
[ ] CHERRY from #528
- 38b10fa3
- 281cfcc6
- 7802febd: backfill gaps with pre-gap close
[ ] outstanding obvious regression due to this patch set :joy:
- [ ] on ib stocks slow to fast chart projection region is now way off: => pretty sure this is fixed now after reworking the gap filling logic inside .data.history.start_backfill()
[ ] drop market store code in general depending on outcome of above discussion.
- [ ] drop .storage.marketstore and anyio-marketstore dep?
- [ ] wipe the supervisor code using the .service._ahab layer?
- [ ] cleaning up remaining unused and now commented code from .data.history!
from https://github.com/pikers/piker/issues/485
- [ ] making .storage with subpkgs for backends and an API / mgmt layer
outstanding tsdb bugs:
- 436
- 323

docs on new filesystem layout and config options:

[ ] nativedb/ dir

[ ] add [storage] section to conf.toml:

[storage]
datadir = 'nativedb'
fspdir = 'fsp'
ohlcvdir = 'ohlcv'
shm_buffer_size = '80Mb'
parquet_compression = 'snappy'
parquet_lib = 'fastparquet'
replication_backend = 'borg'
# these hosts would be looked up in the network section and
# contacted appropriately based on IPC info from there?
replication_dsts = ['hostname1', 'hostname2']

from #312 we need chart-UI integration for a buncha stuff:
- [ ] main thing to get done would be a context-menu reload history for a highlighted section or gap B)
[ ] .storage.cli refinement:
- [ ] (#313) documenting --tsdb is no longer needed since we don't need to offer optional docker activation, since we don't need it using nativedb backend!
- [ ] tidying up and formalizing the set of piker store cmds
- [ ] making the anal subcmd do gap detection and discrepancy reporting (at the least) against market-venue known operating hours.
[ ] new natived backend implemented with polars + apache parquet files B)
- [x] since we're already moving to use typer in #489, let's also add confirmation support for the new pikerd storage -d flag:
- added and used in the new .storage.cli!
- [ ] do confirms for deletes? https://typer.tiangolo.com/tutorial/prompt/#confirm
- [ ] gap backfilling (as detailed in https://github.com/pikers/piker/pull/486/commits/f45b76ed77eafdf44871d3e3305f7dc18e9de938) still requires some work for full functionality including:
- [ ] UI needs a cross-actor event in the history chart's update loop to ensure we do a forced graphics data formatter update when gap-backfilling is complete.
- [x] rt ingest and fast parquest update deferred to #536
- [ ] currently we aren't storing rt data (received during data session but not previously written to storage) on teardown..
- consider writing the arrow IPC files and then flushing to dfs and then parquet at some frequency / teardown?
- [ ] related to above, what about for FSP ingest and storage?
- [ ] https://github.com/pikers/piker/issues/314 probably should be re-created but for nativedb and a new writeup around arrow IPC and feather formats?
- [ ] (likely as follow up) use the lazy polars API to do larger then mem processing both for charting and remote (host) processing:
- from the guide:
- from API docs:
- [ ] use polars to do price series anomaly repairs, such as is causes by stock splits or for handling bugs in data providers where a ticker name was repurposed for a new asset and the price history has mega gap:
- [ ] deciding on file organization, naming schema, subdirs for piker subsystems, etc.
- [ ] should we store multiple files segmented by some time period and then simply use the multiple files reader support: https://pola-rs.github.io/polars-book/user-guide/io/multiple/
- [ ] current file naming scheme is mnq.cme.20230616.ib.ohlcv1s.parquet but we can probably change the meta-data token part ohlcv1s to be more parse-able and readable?
  - put . in: ohlcv.1s.<otherinfo> ?
  - what do we do for fsp stuff, at least a .config/piker/nativedb/fsp/ subdir?
- [ ] what is writing deltas and can we use it?
  - https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.write_delta.html#polars.DataFrame.write_delta

goodboy commented 1 year ago

We can convert this to a draft if necessary if/when #483 lands

guilledk commented 1 year ago

I'm in favor of doing or own solution and I would rather stop maintaining any marketstore related coded, in the end we were almost gonna spend as much work mantaining marketstore that just doing our own thing right.

goodboy commented 1 year ago

I'm in favor of doing or own solution and I would rather stop maintaining any marketstore related coded, in the end we were almost gonna spend as much work mantaining marketstore that just doing our own thing right.

yup totally agree!

ok then i'll be putting up some finishing functionality touches, hopefully tests, and then dropping all that junk 🏄🏼

goodboy commented 1 year ago

To give an idea of what the parquet subdir looks like now, much in similarity to how marketstore laid out it's own internal per table binary format files except using less space and actually being a file type data people can use 😂 screenshot-2023-06-14_14-08-23

pikers / piker

Start `piker.storage` subsys: cross-(ts)db middlewares #486

Core dev discusssion

ToDo:

436

323