Optimize step data transmission

ezio-melotti commented 3 years ago

Currently for each step that we send, we repeat a lot of data. We could remove this duplication by sending an object once at the beginning that contains repeated information, such as:

the units for each currency
the order of the currencies/values in a group
the "nice" names for the currencies/agents

For example, the "total_production" group in each step data currently looks like:

  "total_production": {
    "atmo_co2": {
      "value": 0.025916,
      "unit": "1.0 kg"
    },
    "atmo_o2": {
      "value": 0,
      "unit": ""
    },
    "h2o_potb": {
      "value": 4.75,
      "unit": "1.0 kg"
    },
    "enrg_kwh": {
      "value": 0.000474,
      "unit": "1.0 kWh"
    }
  },

For each currency there is a corresponding object with a "value" and a "unit". We could include in the initial schema an object that maps currencies and units:

"units": {
    "atmo_co2": "kg",
    "atmo_o2": "kg",
    "h2o_potb": "kg",
    "enrg_kwh": "kWh",
    "...": "..."
}

And each resulting simplified step data will look like:

  "total_production": {
    "atmo_co2": 0.025916,
    "atmo_o2": 0,
    "h2o_potb": 4.75,
    "enrg_kwh": 0.000474
  },

By doing this we lose some flexibility, since e.g. each CO2 amount will be expressed in kilograms, even if it's a fraction of a gram. However, the frontend can take care of converting to the most appropriate unit (e.g. from kg to g or mg).

We could also include the "nice" names in the initial object, and use them e.g. in the panels:

"names": {
    "atmo_co2": "Carbon Dioxide",
    "atmo_o2": "Oxygen",
    "h2o_potb": "Water",
    "enrg_kwh": "Energy",
    "...": "..."
}

If we want to optimize further, we can factor out some of the other keys. For example, we could include a schema in the initial object that specifies the order of the values for each group, e.g.:

{"total_production": ["atmo_co2", "atmo_o2", "h2o_potb", "enrg_kwh"]}

And then each step data will only include the following, without repeating the name of the currency or the unit:

{"total_production": [0.025916, 0, 4.75, 0.000474]}

Doing this might increase the complexity of the frontend code though, and might introduce bugs since the role of each value needs to be determined by looking at the initial schema.

Another possible optimization, is combining multiple step data. The backend already sends step data in batches, so a batch of 5 steps could be compressed into something like:

  "total_production": {
    "atmo_co2": [0.025916, ..., ..., ..., ...],
    "atmo_o2": [0, ..., ..., ..., ...],
    "h2o_potb": [4.75, ..., ..., ..., ...],
    "enrg_kwh": [0.000474, ..., ..., ..., ...]
  },

This will require some extra work on both the backend (since it will have to combine the step data), and the frontend (that will have to extract them).

Regardless of the actual structure of the json, we could also look into adding compression at the network level, e.g. by gzipping the data.

For reference, this is what a single step data looks like:

Click to show step

```json { "id": 8396705445451241000, "step_num": 1, "user_id": 5, "game_id": 2026786606093101000, "start_time": 1631907999, "time": 3600, "hours_per_step": 1, "is_terminated": "False", "termination_reason": null, "agent_growth": { "radish": 0 }, "total_agent_count": { "human_agent": 1 }, "total_production": { "atmo_co2": { "value": 0.025916, "unit": "1.0 kg" }, "atmo_o2": { "value": 0, "unit": "" }, "h2o_potb": { "value": 4.75, "unit": "1.0 kg" }, "enrg_kwh": { "value": 0.000474, "unit": "1.0 kWh" } }, "total_consumption": { "atmo_co2": { "value": 0, "unit": "" }, "atmo_o2": { "value": 0.021583, "unit": "1.0 kg" }, "h2o_potb": { "value": 0.165833, "unit": "1.0 kg" }, "enrg_kwh": { "value": 3.723, "unit": "1.0 kWh" } }, "details_per_agent": { "in": { "enrg_kwh": { "solid_waste_aerobic_bioreactor": { "value": 0, "unit": "" }, "multifiltration_purifier_post_treatment": { "value": 0.012, "unit": "1.0 kWh" }, "oxygen_generation_SFWE": { "value": 0, "unit": "" }, "urine_recycling_processor_VCD": { "value": 0, "unit": "" }, "co2_removal_SAWD": { "value": 0, "unit": "" }, "co2_reduction_sabatier": { "value": 0, "unit": "" }, "ch4_removal_agent": { "value": 0, "unit": "" }, "dehumidifier": { "value": 0, "unit": "" }, "crew_habitat_small": { "value": 2.711, "unit": "1.0 kWh" }, "greenhouse_small": { "value": 1, "unit": "1.0 kWh" }, "radish": { "value": 0, "unit": "" } }, "atmo_co2": { "co2_removal_SAWD": { "value": 0, "unit": "" }, "co2_reduction_sabatier": { "value": 0, "unit": "" }, "radish": { "value": 0, "unit": "" } } } }, "storage_capacities": { "air_storage": { "1": { "atmo_o2": { "value": 390.097667, "unit": "kg" }, "atmo_co2": { "value": 0.795725, "unit": "kg" }, "atmo_n2": { "value": 1454.3145, "unit": "kg" }, "atmo_ch4": { "value": 0.003483, "unit": "kg" }, "atmo_h2": { "value": 0.001024, "unit": "kg" }, "atmo_h2o": { "value": 18.704167, "unit": "kg" } } }, "water_storage": { "1": { "h2o_potb": { "value": 1345.584167, "unit": "kg" }, "h2o_urin": { "value": 0.0625, "unit": "kg" }, "h2o_wste": { "value": 0.087083, "unit": "kg" }, "h2o_tret": { "value": 144.25, "unit": "kg" } } }, "nutrient_storage": { "1": { "biomass_totl": { "value": 0, "unit": "kg" }, "sold_n": { "value": 100, "unit": "kg" }, "sold_p": { "value": 100, "unit": "kg" }, "sold_k": { "value": 100, "unit": "kg" }, "sold_wste": { "value": 0, "unit": "kg" } } }, "power_storage": { "1": { "enrg_kwh": { "value": 996.277, "unit": "kWh" } } }, "food_storage": { "1": { "food_edbl": { "value": 99.937083, "unit": "kg" } } } } } ```

granawkins commented 2 years ago

Update

In the course of the work for ABM-Redesign, Grant added the AgentDataCollector class, which scrapes all potentially relevant data from an agent each step. It was initially developed for testing, and then became useful for the Jupyter workflow.

Now, as part of adding the ABM-Redesign functionality to the frontend, we will do a thorough update of the collection, storage and transmission of simdata.

The items in this issue (above) are directly relevant and the name works, so I'm co-opting this issue instead of creating a new one.

Plan

Define a new schema for sharing data between frontend/backend
- Optimize size via reorganizing and/or compression
- Fetch specific steps/fields as-needed
Get baseline speed/size figures for comparison
Update AgentDataCollector to support new schema
Update storage/transmission system (GameRunner/Redis)
Update API (Flask/frontend) to new storage/transmission and schema

ezio-melotti commented 2 years ago

The currency_desc.json file could be used to solve the problem of the currency names/units. If these values are added in the file, it could be sent as-is to the frontend.

Get baseline speed/size figures for comparison

At this stage I don't think we need benchmarks. There is clearly a lot of duplicate data being sent, and it will certainly go faster once we remove it. Sending gzipped data at the network level (i.e. just by specifying it in the http headers), might be useful and simple enough to implement, but I would spend too much time working on custom solutions.

granawkins commented 2 years ago

A simple solution would be to have the backend send the output of AgentModel.get_data(debug=True) to the front-end directly.

It includes all fields for all agents/currencies at all steps.

The 4-human-garden full-simulation object is about 1.1MB, compared to the current at ~8MB and includes a small subset of data.

I think we can also really simplify the front-end by storing this and indexing it directly from the panels.

ezio-melotti commented 2 years ago

Regardless of the actual structure of the json, we could also look into adding compression at the network level, e.g. by gzipping the data.

Good news everyone! Looks like we already have this: The highlighted request contains 9 days (216 steps) of data sent through websocket for the 1 human preset, and it was compressed down from 422k to 17k during the transfer. The arrows show that the frontend was already accepting gzip and the backend also encoded the data as gzip.

Exporting the data and gzipping them yields similar values (the format is a bit different):

 379614 simoc-simulation-data-1h-9d.json
  19833 simoc-simulation-data-1h-9d.tar.gz

granawkins commented 2 years ago

Great! Ya looks like socketio uses compression by default.

overthesun / simoc

Optimize step data transmission #141

Update

Plan