[NOT-130] Weeknotes individual update: November 27, 2023

Experimenting with Basin + Python + polars for WeatherXM data analysis

Tableland Basin lets developers use familiar web2 workflows (like Postgres or parquet file uploads) to bring the data into the web3 realm, replicating it to Filecoin for cold storage. (There's a lot that we're adding to Basin to expand the feature set and improve developer experience, too!) WeatherXM, in particular, is using a workflow where they sign & replicate large parquet files on a specific schedule, which contain device data like temperature readings, humidity, wind, etc.

I put together a simple demonstration of how developers can use wxm data in this repo, which calculatres and writes summary metrics to a CSV and markdown file. Here are some notes on what it does and how it was built:

The project is written in python and uses polars to analyze wxm data.
There's a script that runs Basin CLI commands in a subprocess to get information about the Basin signed data in the xm_data.p1 namespace/publication, including the deal CID.
Then, remote CAR files are requested via Web3.Storage (or Piñata) using the polars scan_parquet method.
- Polars is pretty fast compared to normal pandas dataframes, and it has a number of parquet-specific features—all Basin-replicated data is created as parquet files (useful in data-oriented use cases).
- There's a "lazy" and "streaming" feature that significantly improves how fast a dataframe and queries will run.
Once the polars dataframe is created, range queries are executed for averaging across all columns as well as a few aggregates (like total precipitations and number of unique devices). That is, queries across a specific historical time range.
After the analysis is complete, it writes the summary metrics to files, where a GitHub action runs on a schedule to automatically run the analysis.

The demo is relatively complete for anyone to run on their own. A few cleanups are still needed, but it's fully functioning. For example, each file that wxm uploads is 200-250MB, and there are (currently) 5 total files. Since wxm will be continuously pushing more data, the script needs to account for this because it was consume too much memory, particularly, in the GitHub action with limited memory constraints. Also, IPFS gateways can be pretty slow, so downloading the files locally and then cleaning them up after the dataframe is created might improve the setup time.

Getting Basin deal

Basin is still early in development and doesn't have an HTTP API—yet (it's coming)! In order to dynamically fetch publication information and retrieve the data, some workaround with subprocesses were used:

# Get publications for address 0xfc7C55c4A9e30A4e23f0e48bd5C1e4a865dA06C5 (the wxm publication signer)
command = ["basin", "publication", "list", "--address", address]
out = result.stdout # Capture the output from the CLI command

# Get all deals for `xm_data.p1` publication
command = ["basin", "publication", "deals", "--publication", pub, "--format", "json",]

Once the API is launched, it'll make this fetching and retrieval processing significantly more streamlined.

Lazy dataframes & streaming

One interesting learning was how using a polars LazyFrame and streaming option impacted performance. Here's a quick snippet of the setup; the remote_files is simply an array of URLs to IPFS gateways that point to each parquet file under the xm_data.p1 publication:

lazy = pl.scan_parquet(source=remote_files, cache=True, retries=3)
df = lazy.collect(streaming=True)

The scan_parquet method is what creates a LazyFrame, which allows for whole-query optimisation in addition to parallelism. Once this is created, the .collect() method will created a pandas-style DataFrame; the streaming option (run parts of the query in a streaming fashion). Prior to this setup, I was simply doing a .collect() on the first line, but when moving to streaming, it cut the time to create a DataFrame by over 50%. Quite an improvement! And for what it's worth, it was my first time using polars, so there are definitely more optimization that could be implemented. For example, the usage of .collect() or using read_parquet might make more sense with how the queries are implemented.

Polygon gas limit bug fix due to ethersjs

We just published new patch version for the Tableland SDK (v5.1.1) and CLI (v5.4.1). There was an issue on Polygon mainnet and Mumbai that caused all transactions fail due to a gas limit issue. For example, a simple single cell table mutation would fail since the gas limit set by ethers was too low, causing the transaction to consume 99%+ of the limit and lead to an "out of gas" error.

Screenshot 2023-11-27 at 11.51.36 AM.png

If you're curious, you can see how the logic change was implemented here. It's pretty straightforward. If the chain is Polygon, it'll add a 20% gas. Note the ethers getFeeData() method doesn't seem to affect the gas limit but only the estimated price or fee per gas. This is how the logic works:

ethers exports an Overrides type, which lets you set gas settings (and some others).
With the ethers estimateGas method, you can see what the default gas limit being used is.
Then, you can add a 20% bump to this number and use that as the actual transaction's gas limit, ensuring there's 10-20% extra room in the actual execution so that the transaction succeeds.
Below, there's an overrides variable set beforehand as an instance of the Overrides. The estimation is then made for the contract method setController (with its params) before it bumps it by 20%.

const overrides: Overrides = {};
const gasLimit = await contract.estimateGas.setController(
  caller,
  tableId,
  controller,   
  overrides
);
overrides.gasLimit = Math.floor(gasLimit.toNumber() * 1.2);

Note: this example is for ethers v5, which is what the SDK uses. For ethers v6, the order of estimateGas and the method name (e.g., setController) would, basically, be switched.

_{From SyncLinear.com | NOT-130}

tablelandnetwork / weeknotes