tablelandnetwork / weeknotes

A place for weekly updates from the @tablelelandnetwork team
0 stars 0 forks source link

[NOT-130] Weeknotes individual update: November 27, 2023 #111

Closed dtbuchholz closed 10 months ago

dtbuchholz commented 10 months ago

Experimenting with Basin + Python + polars for WeatherXM data analysis

by Dan Buchholz

Tableland Basin lets developers use familiar web2 workflows (like Postgres or parquet file uploads) to bring the data into the web3 realm, replicating it to Filecoin for cold storage. (There's a lot that we're adding to Basin to expand the feature set and improve developer experience, too!) WeatherXM, in particular, is using a workflow where they sign & replicate large parquet files on a specific schedule, which contain device data like temperature readings, humidity, wind, etc.

I put together a simple demonstration of how developers can use wxm data in this repo, which calculatres and writes summary metrics to a CSV and markdown file. Here are some notes on what it does and how it was built:

The demo is relatively complete for anyone to run on their own. A few cleanups are still needed, but it's fully functioning. For example, each file that wxm uploads is 200-250MB, and there are (currently) 5 total files. Since wxm will be continuously pushing more data, the script needs to account for this because it was consume too much memory, particularly, in the GitHub action with limited memory constraints. Also, IPFS gateways can be pretty slow, so downloading the files locally and then cleaning them up after the dataframe is created might improve the setup time.

Getting Basin deal

Basin is still early in development and doesn't have an HTTP API—yet (it's coming)! In order to dynamically fetch publication information and retrieve the data, some workaround with subprocesses were used:

# Get publications for address 0xfc7C55c4A9e30A4e23f0e48bd5C1e4a865dA06C5 (the wxm publication signer)
command = ["basin", "publication", "list", "--address", address]
out = result.stdout # Capture the output from the CLI command

# Get all deals for `xm_data.p1` publication
command = ["basin", "publication", "deals", "--publication", pub, "--format", "json",]

Once the API is launched, it'll make this fetching and retrieval processing significantly more streamlined.

Lazy dataframes & streaming

One interesting learning was how using a polars LazyFrame and streaming option impacted performance. Here's a quick snippet of the setup; the remote_files is simply an array of URLs to IPFS gateways that point to each parquet file under the xm_data.p1 publication:

lazy = pl.scan_parquet(source=remote_files, cache=True, retries=3)
df = lazy.collect(streaming=True)

The scan_parquet method is what creates a LazyFrame, which allows for whole-query optimisation in addition to parallelism. Once this is created, the .collect() method will created a pandas-style DataFrame; the streaming option (run parts of the query in a streaming fashion). Prior to this setup, I was simply doing a .collect() on the first line, but when moving to streaming, it cut the time to create a DataFrame by over 50%. Quite an improvement! And for what it's worth, it was my first time using polars, so there are definitely more optimization that could be implemented. For example, the usage of .collect() or using read_parquet might make more sense with how the queries are implemented.

Polygon gas limit bug fix due to ethersjs

We just published new patch version for the Tableland SDK (v5.1.1) and CLI (v5.4.1). There was an issue on Polygon mainnet and Mumbai that caused all transactions fail due to a gas limit issue. For example, a simple single cell table mutation would fail since the gas limit set by ethers was too low, causing the transaction to consume 99%+ of the limit and lead to an "out of gas" error.

Screenshot 2023-11-27 at 11.51.36 AM.png

If you're curious, you can see how the logic change was implemented here. It's pretty straightforward. If the chain is Polygon, it'll add a 20% gas. Note the ethers getFeeData() method doesn't seem to affect the gas limit but only the estimated price or fee per gas. This is how the logic works:

const overrides: Overrides = {};
const gasLimit = await contract.estimateGas.setController(
  caller,
  tableId,
  controller,   
  overrides
);
overrides.gasLimit = Math.floor(gasLimit.toNumber() * 1.2);

From SyncLinear.com | NOT-130