oceanprotocol / pdr-backend

Instructions & code to run predictoors, traders, more.
Apache License 2.0
28 stars 22 forks source link

[CLI][ETL][Update] CLI Lake RAW + ETL update() command can do one-shot or loop-forever #1107

Closed idiom-bytes closed 3 months ago

idiom-bytes commented 4 months ago

Motivation

We need to put Lake/ETL on a thread. This should run all the time, compile the data, and have it there ready to be served.

To address these requirements, we'll have to support updating GQLDF + ETL, such that st_ts & end_ts can be "natural language" dates... such as "1d ago" and "now.

Now use the implementation from the discussion... https://github.com/oceanprotocol/pdr-backend/pull/1095#discussion_r1617991978

Here is what that fix looks like

@enforce_types
def do_lake_etl_update(_, ppss):
    """
    @description
        This runs all dependencies to build analytics
        All raw, clean, and aggregate data will be generated
        1. All subgraph data will be fetched
        2. All analytic data will be built
        3. Lake contains all required data
        4. Dashboards read from lake

        Please use nested_args to control lake_ss
        ie: st_timestr, fin_timestr, lake_dir
    """
    # pseudo code, spawn a thread, i dont care...
    while true:
       # Move that code here....
       st_ts_ms = UnixTimeMs.from_timestr(ppss.lake_ss.st_timestr) # "1 day ago"
       fin_ts_ms = UnixTimeMs.from_timestr(ppss.lake_ss.fin_timestr) # "now"

       # pass fixed time through pipeline
       gql_data_factory = GQLDataFactory(ppss)
       etl = ETL(ppss, gql_data_factory)
       etl.do_etl(st_ts_ms, fin_ts_ms)

Todo:

idiom-bytes commented 3 months ago

This is being tracked in the backlog. Closing for now, reopen 1 by 1.